Closed Bug 670885 Opened 8 years ago Closed 6 years ago

TI+JM: jslinux slower with TI

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED

People

(Reporter: jandem, Assigned: jandem)

References

(Depends on 1 open bug, Blocks 1 open bug, )

Details

(Keywords: perf)

Attachments

(4 files)

http://bellard.org/jslinux/ (note that the emulator prints the total time to boot to the "console")

Stock JM does very well here, it boots in about 4 seconds. After landing bug 663485 I get ~15 seconds with TI+JM. The patch made us a few seconds faster, but we still spend a lot of time in stubs::GetElem, need to figure out why we don't inline it.
Debugged this a bit in gdb with Brian's help. There are multiple problems here, a test case for the first (and probably most important) problem:
--
function f(a) {
    var x, y;
    if (a)
        x = new Uint8Array(50);
    var t0 = new Date;
    for (var i=0; i<10000000; i++) {
        y = x[1];
    }
    print(new Date - t0);
}
f(true);
--
882 ms with -m -n, 22 ms after commenting out line 3.

For the GETELEM x[1], the type of x is {void, Uint8Array}. To determine the typed array type we call getSingleObject, this returns NULL because there's also "void" in the type set and therefore we don't inline the GETELEM. This shouldn't be too hard to fix, we can probably just ignore the primitive types because there's already an "is-it-really-an-object" check if we're not sure.

Other problems:
2) The type of the key is sometimes unknown, but in this case we insert an is-int32 check so this shouldn't stop us from inlining the GETELEM.

3) There's a float/void type barrier. For an Int*Array GETELEM this causes us to always allocate a type register, but we should still be able to inline it.

4) Some functions have many locals (like > 100, many of them functions). SSA analysis has a 50 locals limit somewhere. Maybe related to 2.

Not sure how much 2-4 hurt perf though. 2 and 3 shouldn't cause us to be slower than JM-without-TI because it also has to check the key type and allocate a type register.
Attached patch PatchSplinter Review
Use getObject/getObjectCount instead of getSingleObject. The test case in comment 1 now runs in 26 ms. Still 22 ms without the "if". Also added a test to make sure we don't crash if the object becomes undefined (or another primitive type). The test crashes (of course) if I remove the is-object checks in getelem_typed or setelem_typed.
Assignee: general → jandemooij
Status: NEW → ASSIGNED
Attachment #545642 - Flags: review?(bhackett1024)
Comment on attachment 545642 [details] [diff] [review]
Patch

Review of attachment 545642 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/methodjit/FastOps.cpp
@@ +1455,4 @@
>              // Inline typed array path. Look at the proto to determine the typed array type.
> +            types::TypeObject *object = types->getObject(0);
> +            int atype = object->proto->getClass() - TypedArray::slowClasses;
> +            JS_ASSERT(atype >= 0 && atype < TypedArray::TYPE_MAX);

You should call addFreeze on types somewhere around here, I think at this point recompilation will only be triggered if the type tag or typed-arrayness of the object changes (a different type of typed array could come in without triggering a recompile).

FWIW I think this freeze constraint stuff is over-complicated, wastes memory and probably doesn't buy us much.  Will be simplifying pretty soon.

@@ +1975,4 @@
>              // Inline typed array path. Look at the proto to determine the typed array type.
> +            types::TypeObject *object = types->getObject(0);
> +            int atype = object->proto->getClass() - TypedArray::slowClasses;
> +            JS_ASSERT(atype >= 0 && atype < TypedArray::TYPE_MAX);

ditto
Attachment #545642 - Flags: review?(bhackett1024) → review+
(In reply to comment #3)
> You should call addFreeze on types somewhere around here, I think at this
> point recompilation will only be triggered if the type tag or
> typed-arrayness of the object changes (a different type of typed array could
> come in without triggering a recompile).

Hm yes makes sense. Reminds me btw that I still need to spend some time to understand what all the constraints do and when they're added. Pushed with the types->addFreeze(cx) calls added:

http://hg.mozilla.org/projects/jaegermonkey/rev/8e009db2de15
Patch improved this from 15 seconds to 11-12 seconds on 32-bit. Still a lot of time in stubs::GetElem and the compiler. Need to get a profiling build to get meaningful numbers, but for now I'll focus on the GETELEM's.
Iterate all objects in the type set to determine the typed array type. I also experimented with a new freeze-constraint to trigger recompilation only if an array of a different type is added. It worked fine but I didn't add it to this patch because it added complexity, didn't reduce the number of recompilations on jslinux and you mentioned that you were going to simplify the freeze constraints soon anyway.
Attachment #545905 - Flags: review?(bhackett1024)
Comment on attachment 545905 [details] [diff] [review]
Iterate all objects

Review of attachment 545905 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #545905 - Flags: review?(bhackett1024) → review+
Pushed: http://hg.mozilla.org/projects/jaegermonkey/rev/d7d452b4b90b

Down to 9.7 seconds, 2x slower than FF 5. Just a few calls to stubs::GetElem remaining so we are able to inline them now.

Next problem is reducing the number of recompilations. There's one really huge script of 32,754 bytes (probably about 10,000 ops). We recompile it 260 times. A lot of time in FrameState::sync etc.
(In reply to comment #8)
> We recompile it 260 times.

About a third of these are caused by recompiling for inlining: The script becomes hot and we recompile to inline calls. Then at some point we recompile again and reset the use count. Then the script becomes hot again, etc.

IIUC, reducing the number of recompilations *not* for inlining will also reduce the number of recompilations for inlining because we will reset the use count less often. But it might still be good to limit the recompilations-for-inlining somehow.
We probably shouldn't be inlining at all in scripts this large.  The metrics we use for determining when to compile and inline are static, and not affected by either the size of the script or the number of times we have previously compiled it.  They are currently 40 uses and 10000 uses, respectively, but I'm not sure how best to tune them (need to see more scripts that incur a big recompilation cost).  Probably a linear scaling in script length could work (quadratic seems like it would over-penalize big scripts), but it needs numbers for 'a' and 'b' in the 'a*x + b'.  To figure those out it would be cool if these constants (current and future ones) were configurable to make it easy to experiment with different tunings.  e.g. it would be nice to know if making the '40' bigger would improve this situation without needing to recompile the browser.

It would also be really super slick to be able to do partial compilation of scripts --- we see part of a script getting hot (a loop, or a subset of a loop), compile that part but don't compile the rest.  Java does this I think.  We spend less time in the compiler so can compile sooner, don't trigger recompilation from type changes in code we didn't compile.

From an architecture standpoint, doing this in JM would be pretty simple --- we can already enter jitcode at loop heads from the interpreter, could make a C++ call to reenter the interpreter on leaving a compiled fragment, and more or less compile loop bodies independently from the rest of the script anyways.  I think the requirements on IonMonkey will be similar, not sure yet whether to experiment with this now or make sure we will be able to do this easily in IonMonkey.
(In reply to comment #10)
> From an architecture standpoint, doing this in JM would be pretty simple ---
> we can already enter jitcode at loop heads from the interpreter, could make
> a C++ call to reenter the interpreter on leaving a compiled fragment, and
> more or less compile loop bodies independently from the rest of the script
> anyways.  I think the requirements on IonMonkey will be similar, not sure
> yet whether to experiment with this now or make sure we will be able to do
> this easily in IonMonkey.

This should be easy (in theory) since it has a CFG. We could trim parts of the graph we don't want. Wherever you trim off a branch that could be taken, the removed jump target would change to a bailout.
Whoops, I accidentally answered the IRC thing instead of what you mentioned in this bug. I think compiling loop bodies independently should be possible too, once we have OSR - the bottom of the loop would be a bailout and the top of the loop would have an OSR entry point.
Most of the other recompilations are from stubs::TypeBarrierHelper for GETELEM ops. Two arrays that are used a lot are locals 2 and 20. Local 2 is a dense array:
--
void object[1] #27:15

#27:15 : Array:prototype packed dense {
    (index): [own] unknown
}
--
Local 20 is an Uint8Array:
--
#28:57 : Uint8Array:prototype typed {
    (index): [own] void int
}
--
For these arrays we recompile a lot because either:

1) the type is missing:
--
#155:23310:   9  getlocal 20
  type 0: void object[1] #28:57
#155:23313:   9  localinc 61
  type 0: unknown
#155:23316:   9  getelem
  typeset 1057: missing
  type 0: missing
  barrier: void int
#155:23319:   9  setlocal 9
  type 0: missing
--
After recompiling there's a void barrier:
--
#155:23310:   9  getlocal 20
  type 0: void object[1] #28:57
#155:23313:   9  localinc 61
  type 0: unknown
#155:23316:   9  getelem
  typeset 1057: int
  type 0: int
  barrier: void
#155:23319:   9  setlocal 9
  type 0: int
--
It looks like we repeat this process for many GETELEM's. The same happens for local 2, the dense array. The only difference is that the type there is "unknown" instead of "void or int".

For a typed array, reading out-of-bounds values adds undefined to the element type. It would be really nice if we could only add undefined to the GETELEM where we do this and not penalize other GETELEM's. Or else, if the element type of a typed array is void/int we should assume GETELEM reads an int and add a void barrier or something.

2) To make things worse, at some point doubles are added to local 2, the dense array. We have to recompile a number of times to update the type barriers for many GETELEM's.

I'd need to instrument the browser to get numbers but I'm sure that these two are responsible for a large number of compilations.
Oh btw, I will be mostly offline for a few weeks. So anybody, feel free to steal/close/clone this bug.
Given that typed array lengths are immutable, would it make any sense to include the length as part of the type information of the typed array or something?
The 'missing' observed type sets show up when the code has never actually run and produced a value, in which case we are compiling doomed code --- if it ever runs, it will trigger a recompilation.

I think we could be smarter here on typed arrays, since their behavior is so heavily constrained (can't have non-number elements, for example).  Currently the analysis doesn't treat typed arrays any differently from other arrays, they're just arrays of numbers, and for any property access the analysis is paranoid and doesn't want to put types in the observed set unless the code has actually run (the types might change by the time the code does run, and a different type entirely may be the only one observed at the site).

Would rather not do this, though, as this approach doesn't get at the root cause, which is that we shouldn't be compiling code which has never executed.  I kind of want to do partial compilation for JM+TI, as the architecture changes are minor and the main issues are organizational difficulties (where to keep compiled code for loops, how to bail into the interpreter without recompiling, how to trigger recompilation once a bailout gets hot enough) which apply equally to JM+TI and IonMonkey.
Attached file jslinux shell testcase
Here's a shell version of jslinux, asked and got permission from Fabrice to post it here. Just unzip and run like this:

$ a/b/js jslinux.js

I "beautified" all files using jsbeautifier.org. Also hacked the terminal output using putstr to get it to nicely print its output. Terminal input does not work but we don't need it here.

Perf is comparable to the website, maybe a tiny bit faster because the website uses a 10 ms setTimeout and I had to emulate setTimeout using an array and a loop.

I get the following numbers on JM tip:

js -m       :  2.9
js -m -j -p :  3.0
js -m -n    :  9.0
js (-j)     : 28.5
Attached patch WIPSplinter Review
Patch to avoid compiling code that is known to be unreachable.  This works, and improves the '-m -n' time on my machine from 13s to 10s.  We are still recompiling the huge switch function about 170 times, however, and messing around from the thresholds triggering compilation doesn't help much.  May want a mechanism to compile large scripts one chunk at a time (for some definition of 'chunk') to avoid quadratic recompilation behavior in them.

I'm also not convinced that compilation is the biggest bottleneck here, but don't have much support for this belief, and can't get shark or callgrind or any other perf tool to give me usable data.
You can try just putting rdtsc in js::Compiler's constructor/destructor.
(In reply to Brian Hackett from comment #18)
> I'm also not convinced that compilation is the biggest bottleneck here, but
> don't have much support for this belief, and can't get shark or callgrind or
> any other perf tool to give me usable data.

Here's a small part of the call tree in Instruments (had to trim some lines):
---
5718.0ms   56.5%    js::mjit::TryCompile
5655.0ms   55.8%      js::mjit::Compiler::performCompilation
  48.5% (3.3%)          js::mjit::Compiler::generateMethod()
  11.0% (0.8%)            js::mjit::Compiler::inlineCallHelper
   7.8% (0.2%)            js::mjit::Compiler::jsop_getelem
   5.1% (0.6%)            js::mjit::Compiler::jsop_bitop(JSOp)
309.0ms    3.0%           js::mjit::FrameState::syncForBranch
274.0ms    2.7%           js::mjit::Compiler::jsop_localinc
253.0ms    2.5%           js::mjit::Compiler::startLoop
206.0ms    2.0%           js::mjit::Compiler::jsop_setelem
177.0ms    1.7%           js::mjit::Compiler::jsop_binary
---
And a list of functions sorted by self time:
--
Running (Self)  Symbol Name
590.0ms    5.8%	js::mjit::FrameState::sync
364.0ms    3.5%	js::mjit::Compiler::generateMethod()
255.0ms    2.5%	js::mjit::FrameState::syncForAllocation
242.0ms    2.3%	js::mjit::Compiler::finishThisUp
184.0ms    1.8%	js::mjit::LoopState::analyzeLoopBody
167.0ms    1.6%	js::mjit::StubCompiler::emitStubCall
162.0ms    1.6%	__bzero
159.0ms    1.5%	js_ValueToBoolean(js::Value const&)
155.0ms    1.5%	js::mjit::FrameState::syncAndKill
145.0ms    1.4%	JSC::X86Assembler::X86InstructionFormatter::memoryModRM
144.0ms    1.4%	js::mjit::stubs::FixupArity
142.0ms    1.4%	memmove$VARIANT$sse3x
141.0ms    1.3%	JSC::X86Assembler::X86InstructionFormatter::oneByteOp
--
If I'm reading this right we spent 56.5% under mjit::TryCompile. Not sure about the other 40% - most of it seems to be mjit code (although the mjit code confuses Instruments a bit)

Time under stub calls ("Focus on subtree" seems to be the only way to get this data in Instruments):

stubs::FixupArity:     144 ms
stubs::ValueToBoolean: 135 ms
stubs::LookupSwitch:    44 ms
stubs::Ursh:            27 ms
Depends on: 689284
This has interesting behavior with the patch in bug 706914.  With 1000 byte compilation chunks, compilation time goes way down (4662ms -> 128ms) but boot time goes way up (9.1s -> 13s).  Needs investigation.
Depends on: 706914
The latest patch in bug 706914 fixes some issues that were hurting this benchmark --- compilation wasn't getting triggered properly when jumping between compilation chunks, and the optimized handling of table switches was disabled.  The boot time with that patch is down to 3.5 seconds, which is still slower than plain JM but not grossly so.  I suspect that fixing bug 704387 will help this test a lot.
Depends on: 704387
Nightly 24.0a1 (2013-06-18) - Booted in 5.344s
Google Chrome 27.0.1453.110 m - Booted in 6.599s
Internet Explorer 10 - Booted in 15.081s

Yay!
Yep the baseline compiler helped this get closer to JM-without-TI. I think we can close this now.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.