Closed Bug 700108 Opened 12 years ago Closed 12 years ago

IonMonkey: Implement OSR

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

      No description provided.
Depends on: 700111
Depends on: 700211
Attached patch Implement OSR, v1. (obsolete) — Splinter Review
The attached patch implements On-Stack Replacement for x86 and x86_64. OSR has been through several redesigns; the patch attempts to keep some changes from previous designs that would avoid some pitfalls in the future.

Design:

OSR takes advantage of LSnapshots to reason about JIT state, since LSnapshots, although existing for the opposite purpose, already encode all the information necessary to reconstruct the JIT environment given a pointer to an interpreter StackFrame. This idea was known from the beginning -- the redesigns mainly focused on how to represent an OSR entry in the MIR.

Naively, as occurred in the first design, one can think of an OSR entry as being a separate block, with no predecessor, having the target loop header as its successor. Since OSR may change values, we can simply insert additional MPhi nodes to handle the join points, and those can take some no-op MOsrValue instructions as input. This sounds appealing, but 1) Phi placement is incredibly obnoxious, and 2) Our code _really_ does not like the idea of there being a second block with no predecessor, to the extent that effectively every phase requires changes. So we don't do that.

Instead, OSR triggers insertion of a "loop preheader" block, inserted before the loop header, replacing every active slot in the MBasicBlock with an MOsrPhi no-op instruction that takes a Value input and has a Value output. This forces materialization of the Value before the OSR entry point, represented by an MOsrEntry instruction; and since we replaced the slots themselves, the MOsrPhi instructions are automatically used in place of the Values they mask. Codegen for the LOsrEntry then occurs as a separate phase in the CodeGenerator, so that no actual code exists on the inline path.

Other:

The code for generateOsrEntry() has been made platform-specific mostly because it is _significantly_ prettier that way. It was originally written with #ifdefs all over the place in CodeGenerator-x86-shared.cpp, then prettified.

TODO:

Note that this OSR patch causes about 40 additional test failures in jit-tests. A good number of these are bugs with TI integration with IonMonkey in general, since OSR provides the unique opportunity to compile a function with type information mostly lacking after a loop boundary.

Recompilation heuristics are sorely missing -- OSR code is actually known-incorrect without recompilation, since it is possible for us to re-enter a function after a type guard has tripped. We have resolved to ignore the issue for the time being, even though it causes known test failures. A potential stopgap is for us to present an LSnapshot through to the OSR entry function, which can check for type mismatches before entering JIT code.

LICM has also been disabled if compilation includes an OSR entry: in the case of nested loops, with the MOsrEntry on the inner loop's preheader, LICM may hoist to the outer loop's header. Logic to fix this well is a bit more complicated than I would like to include in this patch, so I just left it to future work.

ARM support, which should look similar to the x86_64 code, since ARM has a ScratchRegister. We need this before committing. I will look into it in short order.
Attachment #573130 - Flags: review?(dvander)
Thanks for the writeup - reviewing now, but some questions in the interim:

> Note that this OSR patch causes about 40 additional test failures in
> jit-tests. A good number of these are bugs with TI integration with
> IonMonkey in general, since OSR provides the unique opportunity to compile a
> function with type information mostly lacking after a loop boundary.

Are these failures all with -n, and all NYIs? If not, there might be real bugs.

> Recompilation heuristics are sorely missing -- OSR code is actually
> known-incorrect without recompilation

We should have a follow-up dependent bug to address this, until we get OSI. It can be super simple, like: Bailout() sets a "never run" flag on an IonScript, that bans OSR.

> LICM has also been disabled if compilation includes an OSR entry: in the
> case of nested loops, with the MOsrEntry on the inner loop's preheader, LICM
> may hoist to the outer loop's header. Logic to fix this well is a bit more
> complicated than I would like to include in this patch, so I just left it to
> future work.

If I'm reading this right, LICM is entirely disabled in the presence of an OSR entry? That seems fine to land but we should definitely fix this soon - can be a follow-up bug though.
(In reply to David Anderson [:dvander] from comment #2)
> Thanks for the writeup - reviewing now, but some questions in the interim:
> 
> > Note that this OSR patch causes about 40 additional test failures in
> > jit-tests. A good number of these are bugs with TI integration with
> > IonMonkey in general, since OSR provides the unique opportunity to compile a
> > function with type information mostly lacking after a loop boundary.
> 
> Are these failures all with -n, and all NYIs? If not, there might be real
> bugs.

There are a few failures without -n; the vast majority require TI. They seemed to fail as a result of stack explosion (OSR -> bailout -> OSR -> bailout -> ...). I suspect that the others can be caught by the "never run" flag.

> > Recompilation heuristics are sorely missing -- OSR code is actually
> > known-incorrect without recompilation
> 
> We should have a follow-up dependent bug to address this, until we get OSI.
> It can be super simple, like: Bailout() sets a "never run" flag on an
> IonScript, that bans OSR.

This is a good idea.

> > LICM has also been disabled if compilation includes an OSR entry: in the
> > case of nested loops, with the MOsrEntry on the inner loop's preheader, LICM
> > may hoist to the outer loop's header. Logic to fix this well is a bit more
> > complicated than I would like to include in this patch, so I just left it to
> > future work.
> 
> If I'm reading this right, LICM is entirely disabled in the presence of an
> OSR entry? That seems fine to land but we should definitely fix this soon -
> can be a follow-up bug though.

Yes and yes. OSR will mostly work with LICM enabled, except in the case of nested loops with an OSR preheader on an inner loop.
Comment on attachment 573130 [details] [diff] [review]
Implement OSR, v1.

Review of attachment 573130 [details] [diff] [review]:
-----------------------------------------------------------------

Great work, I'm really glad that the OsrEntry and OsrPhi ideas worked out. This nicely takes advantage of the existing compiler guts. Leaving r? since I have a few comments/questions below.

One thought on the CopyOsrSlot complexities is that maybe this is something the MoveResolver could solve later on.

::: js/src/ion/CodeGenerator.cpp
@@ +248,5 @@
> +bool
> +CodeGenerator::visitOsrEntry(LOsrEntry *lir)
> +{
> +    lir->setFrameDepth(masm.framePushed());
> +    encode(lir->snapshot()); // TODO: Necessary?

Nope, not necessary, unless the snapshotOffset will be used somewhere. I'd put it in the DEBUG below so you get the encoding spew, which is useful.

::: js/src/ion/Ion.cpp
@@ +703,5 @@
> +    JSScript *script = fp->script();
> +    IonScript *ion = script->ion;
> +
> +    if (ion->osrPc() != pc)
> +        return false;

Can't return false here since there's no error. This sounds like something heuristics related, so maybe it should be hoisted out of ion::SideCannon().

::: js/src/ion/Ion.h
@@ +136,2 @@
>  bool Cannon(JSContext *cx, StackFrame *fp);
> +bool SideCannon(JSContext *cx, StackFrame *fp, jsbytecode *pc);

Nice.

::: js/src/ion/IonAnalysis.cpp
@@ +43,4 @@
>  #include "MIRGraph.h"
>  #include "Ion.h"
>  #include "IonAnalysis.h"
> +#include "jsnum.h"

Needed?

@@ +879,5 @@
>      return true;
>  }
>  
> +void
> +ion::AssertGraphCoherency(MIRGraph &graph)

Nice.

::: js/src/ion/IonCode.h
@@ +67,4 @@
>    protected:
>      uint8 *code_;
>      JSC::ExecutablePool *pool_;
> +    uint32 osrEntryOffset_;         // Offset to OSR entrypoint from |code|, or 0 if none.

I think this belongs in IonScript. IonCode is kind of a weird place.

::: js/src/ion/MIRGraph.h
@@ +142,5 @@
>      // operations, as it will not create new SSA names for copies.
>      void initSlot(uint32 index, MDefinition *ins);
>  
> +    // Replace all slots with MOsrPhi instructions, adding them to the block.
> +    void makeSlotsOsrPhis();

nit: "makeOsrPhis" or "insertOsrPhis" might read easier

::: js/src/ion/ValueNumbering.cpp
@@ +350,5 @@
>                  continue;
>              }
>  
> +            // GVN must preserve reloading from OsrPhi sites.
> +            if (info_.withOsr() && (!ins->isIdempotent() || ins->usesOsrPhi())) {

Talked in IRL: Instead, let's just make OsrPhis always be incongruent to everything. Then we can get rid of usesOsrPhi as well as all of the ValueNumbering changes.

::: js/src/ion/arm/Lowering-arm.cpp
@@ +74,4 @@
>  bool
>  LIRGeneratorARM::visitConstant(MConstant *ins)
>  {
> +    if (ins->canEmitAtUses() && ins->type() != MIRType_Double)

FWIW, I like this - "canEmitAtUses" reads a lot better than "isEmittedAtUses".

::: js/src/ion/x64/CodeGenerator-x64.cpp
@@ +211,5 @@
> +          case MIRType_Int32:   masm.unboxInt32(source, payloadReg);   break;
> +          case MIRType_String:  masm.unboxString(source, payloadReg);  break;
> +          case MIRType_Object:  masm.unboxObject(source, payloadReg);  break;
> +          case MIRType_Boolean: masm.unboxBoolean(source, payloadReg); break;
> +          case MIRType_Value:   masm.movq(source, payloadReg);         break;

Nit: statement and breaks on newlines. It's a little more space-iferous, but much easier to step through.

::: js/src/ion/x64/MacroAssembler-x64.h
@@ +375,5 @@
> +        movq(ImmWord(JSVAL_PAYLOAD_MASK), dest);
> +        andq(src.valueReg(), dest);
> +    }
> +    void unboxObject(const Operand &src, const Register &dest) {
> +        // Explicitly permits |dest| to be used in |src|.

Probably these functions should assert dest != ScratchReg... though they didn't before.

::: js/src/ion/x64/Trampoline-x64.cpp
@@ +194,5 @@
>      GenerateReturn(masm, JS_TRUE);
>  
>      Linker linker(masm);
> +    IonCode *enterJIT = linker.newCode(cx);
> +    enterJIT->setOsrEntryOffset(noPrologueOffset);

This is kind of hacky. If we move the OSR offset to IonScript, we can just allocate two trampolines - the second can even jump to the first.

::: js/src/ion/x86/Assembler-x86.h
@@ +72,4 @@
>  
>  static const Register ArgumentsRectifierReg = { JSC::X86Registers::esi };
>  
> +static const Register OsrFpReg = { JSC::X86Registers::edx };

Here "Fp" made me think "Floating-point" - maybe Frame would be better.

::: js/src/ion/x86/CodeGenerator-x86.cpp
@@ +235,5 @@
> +    if (!okClobber && type->isRegister() && ToRegister(type) == OsrScratchReg)
> +        return CopyOsrSlot_ConflictsScratch;
> +
> +    // Load payload.
> +    if (payload->isRegister()) {

isRegister includes is isFloatReg, so this should be isGeneralReg()

@@ +237,5 @@
> +
> +    // Load payload.
> +    if (payload->isRegister()) {
> +        masm.movl(payloadSource, ToRegister(payload));
> +    } else if (payload->isFloatReg()) {

Possible to pull this case out above for early return?

@@ +252,5 @@
> +    }
> +
> +    // Load type.
> +    if (type->isConstant() || type->isArgument()) {
> +        // Nothing to do.

How could the type ever be constant?
Attached patch Implement OSR, v2. (obsolete) — Splinter Review
This takes a different approach than in the previous OSR patch: instead of hijacking LSnapshots to determine JIT state, values passed from the interpreter to the JIT are now explicit in the MIR as MOsrValues, with OSR control flow resolved by MPhi instructions. The problems previously encountered with creating a dedicated MBasicBlock for OSR have been solved. The resultant code is significantly simpler and almost entirely architecture-agnostic.

The previous approach worked fine in the vast majority of cases. Unfortunately, it bore the assumption that each value is stored in a canonical location, either in a register or in memory -- so if a value is loaded from a backing store into a register, that backing store is forgotten. This actually holds true nearly all the time, except in the case of the Greedy allocator in the case of nested loops: although a value may be a register, Greedy can re-load from the backing store. In any case, the assumption is not only wrong, but detrimental and unnecessary in retrospect.

The implementation with an independent root OSR block therefore resumed looking very appealing. Changing the code to support this was actually very simple, contrary to prior experience -- it only took about a day, except for GVN and LICM changes. The previous issues encountered were caused by incorrectly setting predecessor information (caught now by AssertGraphConsistency(), part of this patch) and incorrectly setting MBasicBlock slots (now overwritten on block initialization).

Instructions in the OSR block are laid out in the following order:

> MOsrEntry
> MOsrValue
> MOsrValue
>  ...
> MOsrValue
> MStart
> {Unboxes and MoveGroups}

The MOsrEntry sets the StackPointer and remembers masm.size() to calculate the offset into the code buffer for the OSR entrypoint. In the MIR, it defines an Object value in the fixed register OsrFrameReg.

Each MOsrValue takes the MOsrEntry as an operand, so the OsrFrameReg does not have to be hardcoded between MOsrValues, so clobbering it is no longer a problem -- a MoveGroup will be issued, the conflict resolved explicitly in the LIR. The MOsrValue knows the source offset from the frame register, and simply loads it into the register(s) provided by regalloc.

Fallible operations may only occur after the MStart, to which a new MResumePoint is bound (and backported to each MOsrValue).

Additionally, creation of an OSR block triggers the creation of a loop preheader, into which MPhi nodes are automatically inserted to explicitly resolve OSR control flow. The successor of the OSR block is this preheader.

An interesting thing to observe is that with an OSR block, the graph now contains two roots, the default entry block and the OSR block, neither of which has a predecessor. This has implications for GVN, which relies on a dominance tree: there is no longer a tree covering every block, but rather a forest (a graph with multiple disconnected trees). For example, the loop preheader with an OSR predecessor has *no* dominator definitionally: it is reachable from independent roots along control flow with no common intermediary block. Therefore it is the root of its own dominance tree, and so on with other blocks.

To express the concept of "no strict dominator", blocks can now be designated as "self-dominating" by setting a block's immediateDominator to itself. This mimics the existing code, which uses NULL to represent 'unvisited', and sets the default root's immediateDominator to itself (but skips over the first block in most of the logic anyway). GVN then iterates over each tree, only considering value numbers within the tree.

LICM also must make use of dominance information to determine whether hoisting is safe: if a definition is hoisted above a self-dominating block, either the hoist must not be performed, or multiple definitions must be created to satisfy each control path (with an MPhi created in the self-dominating block, and uses replaced by the MPhi). This would require iterating through the loop for each instruction, or remembering the list of all self-dominating blocks, which is a somewhat invasive change. So for now we just ignore all loops with self-dominating blocks. In other words, hoisting can still occur, unless that hoist would cross the OSR block, which is plenty good for the time being.

ARM just needs generateOsrPrologue() written, which will be a scant few lines. I don't have an ARM machine with me at the moment, and it doesn't affect the patch design anyway, so I am leaving it until I trek back to the office and have something to actually test it with.

Everything else is complete. All tests succeed with the exception of NYI assertions.
Attachment #573130 - Attachment is obsolete: true
Attachment #573130 - Flags: review?(dvander)
Attachment #576691 - Flags: review?(dvander)
Attached image Example MIR with OSR.
Code for the iongraph output in the attachment:

> function f() {
> 	var x = 0;
> 
> 	for (var i = 0; i < 1000; i++) {
> 		x = x + 1;
> 		for (var j = 0; j < 7; j++) {
> 			x = x + 1;
> 		}
> 	}
> 
> 	for (var k = 0; k < 7; k++) {
> 		x = x + 1;
> 	}
> 	return x;
> }
> 
> print(f()); // f() isn't jitted.

The graph output is taken after LICM, just before lowering.
(In reply to Sean Stangl from comment #5)
> ARM just needs generateOsrPrologue() written, which will be a scant few lines.

I take that back. ARM only has 4 volatile registers, each of which is consumed to pass an argument. But the current scheme would need 2 additional volatile registers -- one for the 6th argument, one for branching to generateEnterJIT() since ARM cannot branch without clobbering a register beforehand for immediate loading.

So ARM just can't branch at all in this context, so generateOsrPrologue() can't exist on ARM. Here are a couple of possibilities to solve this.

1. On ARM, generateEnterJIT() could always perform a load from the location where a fifth argument would be. Non-volatile registers are already saved at this point and may be freely clobbered. If no fifth argument is passed, a gibberish load is performed, but it is safe and the value is ignored.

Disadvantages are ARM-specific code in SideCannon and IonCompartment, and an unnecessary load on non-OSR Ion transitions.

2. generateOsrPrologue() could be eliminated for all platforms. Instead, code generated for the LOsrEntry could look through the stacks and load the OsrFrameReg from the position it was saved. This is complicated by dynamic alignment.

Disadvantages are a non-generic visitOsrEntry() in the CodeGenerator, LOsrEntry codegen sneakily depending on subtleties in generateEnterJIT(), and an overly complicated OSR path.

The one advantage of this scheme is that on non-Windows x86_64 systems, since the 6th argument is passed in a register, nothing needs to be done at all, and we save a branch. But we could also save this branch in the first solution's implementation by treating non-Windows x86_64 like ARM, with no special generateOsrPrologue().

The first solution sounds better to me. An extra load on the main transition isn't so terrible.
This puts the OSR loads into the inline path for ARM. IonCompartment's interfaces are the same on every architecture for sanity. ARM-specific logic is therefore just in the IonCompartment mark() and sweep() functions. The changes are minimal anyway.

The attached patch compiles for ARM, but OSR cannot yet run, since everything I tried trips various NYI aborts.
Attachment #576691 - Attachment is obsolete: true
Attachment #576691 - Flags: review?(dvander)
Attachment #577029 - Flags: review?(dvander)
Comment on attachment 577029 [details] [diff] [review]
Implement OSR, v3.

Review of attachment 577029 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great! This approach is a lot easier to understand, in retrospect, and the changes to the rest of the pipeline weren't too invasive. Just a few nits, and some random questions.

::: js/src/ion/Ion.cpp
@@ +182,5 @@
> +    // osrPrologue_ is the same as enterJIT_.
> +#else
> +    if (osrPrologue_)
> +        MarkRoot(trc, osrPrologue_.unsafeGet(), "osrPrologue");
> +#endif

Rather than have the architecture-specific #ifdef here, can we just mark it twice?

@@ +202,5 @@
>  {
>      if (enterJIT_ && IsAboutToBeFinalized(cx, enterJIT_))
>          enterJIT_ = NULL;
> +#ifdef JS_CPU_ARM
> +    // osrPrologue_ is the same as enterJIT_.

Same.

@@ +659,5 @@
> +
> +    // Ignore OSR if the code is expected to result in a bailout.
> +    if (script->ion && script->ion != ION_DISABLED_SCRIPT &&
> +        script->ion->isOsrForbidden())
> +        return Method_Skipped;

nit: brace this if

@@ +752,5 @@
>      return !result.isMagic();
>  }
>  
> +bool
> +ion::SideCannon(JSContext *cx, StackFrame *fp, jsbytecode *pc)

This function is mostly identical to ion::Cannon. Is there any way to share the bulk of it, like we did with JM?

::: js/src/ion/IonBuilder.cpp
@@ +2270,5 @@
> +        osrBlock->initSlot(slot, osrv);
> +    }
> +
> +    // Initialize stack.
> +    uint32 numSlots = preheader->stackDepth() - (info().nargs() + 1) - info().nlocals();

nit: CountFunArgs(info.fun()) instead of nargs() + 1

::: js/src/ion/IonCode.h
@@ +67,4 @@
>    protected:
>      uint8 *code_;
>      JSC::ExecutablePool *pool_;
> +    uint32 osrEntryOffset_;         // Offset to OSR entrypoint from |code|, or 0 if none.

Non-nit: this should be a member of IonScript, since non-methods don't participate in OSR.

@@ +213,5 @@
> +    }
> +    void forbidOsr() {
> +        forbidOsr_ = true;
> +    }
> +    bool isOsrForbidden() {

nit: const (and osrPc() too)

::: js/src/ion/LICM.cpp
@@ +64,5 @@
>      for (ReversePostorderIterator i(graph.rpoBegin()); i != graph.rpoEnd(); i++) {
>          MBasicBlock *header = *i;
> +
> +        // Skip non-headers and self-loops.
> +        if (!header->isLoopHeader() || header->numPredecessors() < 2)

Interesting, can we even have self-loops? It seems like we always want a header block.

::: js/src/ion/Lowering.cpp
@@ +440,5 @@
>  bool
> +LIRGenerator::visitOsrEntry(MOsrEntry *entry)
> +{
> +    LOsrEntry *lir = new LOsrEntry;
> +    return defineFixed(lir, entry, LAllocation(AnyRegister(OsrFrameReg)));

Nice, I think this is the right way to go, because now the allocator has to do all the hard work.

::: js/src/ion/MIR.h
@@ +507,5 @@
> +class MOsrEntry : public MAryInstruction<0>
> +{
> +  protected:
> +    MOsrEntry() {
> +        setResultType(MIRType_Object); // &StackFrame, in OsrFrameReg.

This should get its own private MIRType (you can add MIRType_Pointer or MIRType_StackFrame or something) - otherwise it could be tracked as a gcthing in the RA.

::: js/src/ion/arm/CodeGenerator-arm.cpp
@@ +944,5 @@
> +
> +    const ptrdiff_t frameOffset = value->mir()->frameOffset();
> +
> +    masm.load32(Address(ToRegister(frame), frameOffset),  ToRegister(payload));
> +    masm.load32(Address(ToRegister(frame), frameOffset + 4), ToRegister(type));

x86 has constants like NUNBOX32_TYPE_OFFSET, do those exist on ARM too?

::: js/src/ion/shared/CodeGenerator-shared.h
@@ +90,5 @@
> +    inline void setOsrEntryOffset(size_t offset) {
> +        JS_ASSERT(osrEntryOffset_ == 0);
> +        osrEntryOffset_ = offset;
> +    }
> +    inline size_t getOsrEntryOffset() {

nit: const

::: js/src/ion/x86/CodeGenerator-x86.cpp
@@ +113,5 @@
> +
> +    const ptrdiff_t frameOffset = value->mir()->frameOffset();
> +
> +    masm.movl(Operand(ToRegister(frame), frameOffset),  ToRegister(payload));
> +    masm.movl(Operand(ToRegister(frame), frameOffset + 4), ToRegister(type));

nit: NUNBOX32_TYPE_OFFSET, NUNBOX32_PAYLOAD_OFFSET

::: js/src/jsinterp.cpp
@@ +1959,5 @@
>          DO_OP();
>  
> +#ifdef JS_ION
> +    // This always tries to perform OSR, which can lead to stack explosion
> +    // in case bailouts are frequently taken. We will eventually need

What is the case that leads to stack explosion?
Attachment #577029 - Flags: review?(dvander) → review+
(In reply to David Anderson [:dvander] from comment #9)
> Rather than have the architecture-specific #ifdef here, can we just mark it
> twice?

Appears so.

> This function is mostly identical to ion::Cannon. Is there any way to share
> the bulk of it, like we did with JM?

Yep. It's not quite as pretty, though, since we have multiple entrypoints based on whether or not a sixth argument is pushed. That could be changed if we don't mind the extra write. (And if you don't mind an extra load on each function call, the osrPrologue could go away entirely!).

> Interesting, can we even have self-loops? It seems like we always want a
> header block.

I didn't think so, but the LICM code that already existed seemed to suggest that. Also note that backedge() explictly permits numPredecessors() == 1, which is hopefully a legacy thing. I'm fairly confident that we never generate self-loops, but it seems that we may want to fix more than just LICM, so I'll leave things as they were for a future patch to change.

> What is the case that leads to stack explosion?

Whoops -- that comment is outdated. No more explosions.
http://hg.mozilla.org/projects/ionmonkey/rev/9fb668f0baca
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
http://hg.mozilla.org/projects/ionmonkey/rev/30331f149265
Fixes red on opt builds -- numBlocks() is debug-only.
You need to log in before you can comment on or make changes to this bug.