Last Comment Bug 706472 - IonMonkey: not inlining function call in 3bit-bits-in-byte
: IonMonkey: not inlining function call in 3bit-bits-in-byte
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Jan de Mooij [:jandem]
:
Mentors:
Depends on:
Blocks: IonSpeed 716682 686595
  Show dependency treegraph
 
Reported: 2011-11-30 07:26 PST by Jan de Mooij [:jandem]
Modified: 2012-01-09 14:48 PST (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Testcase (451 bytes, application/x-javascript)
2011-11-30 07:26 PST, Jan de Mooij [:jandem]
no flags Details
WIP v0 (15.80 KB, patch)
2011-12-27 14:38 PST, Jan de Mooij [:jandem]
no flags Details | Diff | Review
WIP v0.2 (32.07 KB, patch)
2012-01-03 09:42 PST, Jan de Mooij [:jandem]
no flags Details | Diff | Review
Patch (37.59 KB, patch)
2012-01-04 12:16 PST, Jan de Mooij [:jandem]
dvander: review+
Details | Diff | Review
Patch v2 (45.54 KB, patch)
2012-01-09 03:05 PST, Jan de Mooij [:jandem]
marty.rosenberg: review+
Details | Diff | Review

Description Jan de Mooij [:jandem] 2011-11-30 07:26:10 PST
Created attachment 577951 [details]
Testcase

Attached is a stand-alone version of bitops-3bit-bits-in-byte.

We don't inline the call to fast3bitlookup because ranInference() in canEnterInlinedScript() returns false. A simple workaround is to change the inner loop:
--
for(var y=0; y<256; y++) {
    func(y);
}
--
to
--
for(var y=0; y<256; y++) {
    func(y);
    func(y)
}
--
The unmodified version runs in 32 ms, the second one in 1 ms.

The problem is that TimeFunc is compiled and analyzed right before fast3bitlookup, so we don't inline the call. JM+TI does not have this problem because it recompiles after ~10000 iterations to inline calls.
Comment 1 David Anderson [:dvander] 2011-11-30 07:41:27 PST
Is it too expensive to run inference if we've decided we want to inline something? How is the JM+TI counter implemented?
Comment 2 Brian Hackett (:bhackett) 2011-11-30 07:52:24 PST
(In reply to David Anderson [:dvander] from comment #1)
> Is it too expensive to run inference if we've decided we want to inline
> something? How is the JM+TI counter implemented?

Doing the analysis eagerly runs into problems because it can change type information which in turn affects previous inlining decisions.  This could be controlled for by rechecking that type information once all analysis is done, but instead JM+TI avoids this by only inlining calls once a script gets really hot.  Initial non-inlining versions of the code are compiled, and several thousand iterations/calls later the caller will be recompiled and inline applicable calls.  If at that point the callee still has not been analyzed then it isn't hot anyways.
Comment 3 David Anderson [:dvander] 2011-11-30 09:42:49 PST
Where does the counter get stored and incremented?
Comment 4 Brian Hackett (:bhackett) 2011-11-30 09:58:47 PST
(In reply to David Anderson [:dvander] from comment #3)
> Where does the counter get stored and incremented?

script->useCount is used both for triggering initial compilation and for inlining recompilation.  The counter gets bumped and tested at loop heads and at function entry.  The counter is not updated if the function doesn't have any candidates for inlining (the testing for this is stupid, it just greps for CALL opcodes I think).
Comment 5 Jan de Mooij [:jandem] 2011-12-27 14:38:28 PST
Created attachment 584491 [details] [diff] [review]
WIP v0

Hackish and incomplete first stab. Bumps the script use count at loop heads, bails out if it's >= some limit and invalidates the script.
Comment 6 Jan de Mooij [:jandem] 2011-12-28 06:07:39 PST
This seems to work well for the testcase in comment 0, but I had to work around some invalidation problem, I'll look into these first.
Comment 7 Jan de Mooij [:jandem] 2012-01-03 09:42:22 PST
Created attachment 585441 [details] [diff] [review]
WIP v0.2

Much more complete than the previous patch. x86 is mostly done, but code generation for x64 is a bit more complex - going to finish this tomorrow.
Comment 8 Jan de Mooij [:jandem] 2012-01-04 12:16:14 PST
Created attachment 585847 [details] [diff] [review]
Patch

Like JM, this bumps the use count at function entry and loop heads and if it's >= some value, we invalidate the script and return to the interpreter.

Also removes an invalid LSRA assert: an interval can have a use position at interval->end() and covers() returns false in this case. Since use positions were moved from the virtual register to the interval, I think we should just remove this assert.

Works on x86 and x64, but I will add ARM support before landing this.
Comment 9 Sean Stangl [:sstangl] 2012-01-04 12:23:42 PST
Arbitrarily invalidating scripts worries me because compilation times in IM are going to be significantly greater than in JM, scaling more than quadratically with the size of the script. Could we be smarter, and only try a recompilation if there is a known candidate for inlining that now has type information?
Comment 10 Sean Stangl [:sstangl] 2012-01-04 12:28:54 PST
Note also that if we return to the script in the interpreter, that can re-enter from an OSR entrypoint, and OSR implies deoptimization. So if we're recompiling for the sake of better optimization, we want to avoid OSR as much as possible.
Comment 11 David Anderson [:dvander] 2012-01-04 12:36:31 PST
Comment on attachment 585847 [details] [diff] [review]
Patch

Review of attachment 585847 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/TypeOracle.cpp
@@ +252,5 @@
>  
>  bool
> +TypeInferenceOracle::canInlineCalls()
> +{
> +    JS_ASSERT(script->getUseCount() < js_IonOptions.usesBeforeInlining);

What prevents this function from being called if useCount > usesBeforeInlining?

::: js/src/ion/x86/CodeGenerator-x86.cpp
@@ +401,5 @@
> +bool
> +CodeGeneratorX86::visitRecompileCheck(LRecompileCheck *lir)
> +{
> +    Operand addr(gen->info().script()->addressOfUseCount());
> +    masm.addl(Imm32(1), addr);

This (and the x64 equivalent) will break with moving GC. On x86 you can use the [address + offset] form, and on x64 you can use ScratchReg (to get the script base address into a register and add a relocation marker).
Comment 12 David Anderson [:dvander] 2012-01-04 12:43:21 PST
(In reply to Sean Stangl from comment #9)
> Arbitrarily invalidating scripts worries me because compilation times in IM
> are going to be significantly greater than in JM, scaling more than
> quadratically with the size of the script. Could we be smarter, and only try
> a recompilation if there is a known candidate for inlining that now has type
> information?

Brian suggested this in comment #4, it sounds like good follow-up work.

(In reply to Sean Stangl from comment #10)
> Note also that if we return to the script in the interpreter, that can
> re-enter from an OSR entrypoint, and OSR implies deoptimization. So if we're
> recompiling for the sake of better optimization, we want to avoid OSR as
> much as possible.

It's not yet clear how much OSR impacts performance, I would be surprised if it was significant, but we should be able to measure.
Comment 13 Jan de Mooij [:jandem] 2012-01-04 12:59:49 PST
(In reply to Sean Stangl from comment #9)
> Could we be smarter, and only try
> a recompilation if there is a known candidate for inlining that now has type
> information?

Currently we recompile only if the script has calls. Scanning the bytecode for inlineable calls seems resaonable, but one problem is that if we cannot inline, we'd still have the recompile checks in the script (and it'd have to check for == instead of >= to avoid bailing out on every iteration).
Comment 14 Brian Hackett (:bhackett) 2012-01-04 16:11:21 PST
(In reply to David Anderson [:dvander] from comment #11)
> ::: js/src/ion/x86/CodeGenerator-x86.cpp
> @@ +401,5 @@
> > +bool
> > +CodeGeneratorX86::visitRecompileCheck(LRecompileCheck *lir)
> > +{
> > +    Operand addr(gen->info().script()->addressOfUseCount());
> > +    masm.addl(Imm32(1), addr);
> 
> This (and the x64 equivalent) will break with moving GC. On x86 you can use
> the [address + offset] form, and on x64 you can use ScratchReg (to get the
> script base address into a register and add a relocation marker).

It should be safe to treat JSScripts as non-movable within the compiler.  They will not be nursery allocated by a generational GC, and all jitcode will be purged before doing a compacting GC.  The same holds for Shapes.  I believe that it will be possible to never need to worry about moving pointers embedded in jitcode, which should (I think?) avoid the need to distinguish ImmGCPtr from ImmWord.  Object-related pointers could only be embedded if the object was not nursery allocated, but that's not a big deal.  Embedding object pointers is rare, and would only reasonably happen for objects with singleton types (globals, singleton functions, prototypes, etc.), which could always be allocated in the major heap.
Comment 15 David Anderson [:dvander] 2012-01-04 17:46:34 PST
> I believe that it will be possible to never need to worry about moving pointers embedded in
> jitcode, which should (I think?) avoid the need to distinguish ImmGCPtr from ImmWord.

I'd prefer to keep the distinction so we can support potential changes in the future without much work. I also prefer using the relocatable functions where we can, even if the types are immovable at the moment (it's very easy and shouldn't affect performance). We can #if 0 writing the actual relocation tables if they're taking up memory for no reason.
Comment 16 Jan de Mooij [:jandem] 2012-01-09 03:05:17 PST
Created attachment 586941 [details] [diff] [review]
Patch v2

Addresses review comments. We decided to bake in the script pointer though, it should be safe per comment 14 and avoids a temp register on x86.

Marty, can you review the ARM changes?
Comment 17 Jan de Mooij [:jandem] 2012-01-09 03:09:54 PST
(In reply to David Anderson [:dvander] from comment #11)
>
> What prevents this function from being called if useCount >
> usesBeforeInlining?

Yes, we don't really care about the order in which we check these conditions, so I removed the assert.
Comment 18 Marty Rosenberg [:mjrosenb] 2012-01-09 12:14:40 PST
Comment on attachment 586941 [details] [diff] [review]
Patch v2

Review of attachment 586941 [details] [diff] [review]:
-----------------------------------------------------------------

I've only reviewed the ARM bits of this patch.  Let me know if you had actually wanted me to review any other parts of it.

::: js/src/ion/arm/CodeGenerator-arm.cpp
@@ +1314,5 @@
> +    // since scripts are never nursery allocated and jitcode will be purged before
> +    // doing a compacting GC.
> +    masm.load32(ImmWord(addr), tmp);
> +    masm.ma_add(Imm32(1), tmp);
> +    masm.store32(tmp, ImmWord(addr));

I am a bit worried about having ImmWord(addr) in here twice, since this will result in addr being moved into ScratchRegister twice (likely requiring two instructions each).
Comment 19 Jan de Mooij [:jandem] 2012-01-09 14:46:07 PST
(In reply to Marty Rosenberg [:Marty] from comment #18)
> I am a bit worried about having ImmWord(addr) in here twice, since this will
> result in addr being moved into ScratchRegister twice (likely requiring two
> instructions each).

We talked about this on IRC. A possible fix is to add "add32(Imm32, ImmWord, reg)" to the macro assembler, but Marty is working on a more general fix for this problem by tracking the values stored in the scratch register. It should get rid of the second "move ImmWord into ScratchReg".

Filed a follow-up bug (bug 706472) to avoid unnecessary recompilations.

http://hg.mozilla.org/projects/ionmonkey/rev/90bb9afe9b90
Comment 20 Jan de Mooij [:jandem] 2012-01-09 14:48:01 PST
(In reply to Jan de Mooij (:jandem) from comment #19)
> Filed a follow-up bug (bug 706472)

Embarassing, bug 716682 of course.

Note You need to log in before you can comment on or make changes to this bug.