670885 - TI+JM: jslinux slower with TI

Assignee

Description

•

14 years ago

http://bellard.org/jslinux/ (note that the emulator prints the total time to boot to the "console") Stock JM does very well here, it boots in about 4 seconds. After landing bug 663485 I get ~15 seconds with TI+JM. The patch made us a few seconds faster, but we still spend a lot of time in stubs::GetElem, need to figure out why we don't inline it.

Jan de Mooij [:jandem]

Assignee

Comment 1

•

14 years ago

Debugged this a bit in gdb with Brian's help. There are multiple problems here, a test case for the first (and probably most important) problem: -- function f(a) { var x, y; if (a) x = new Uint8Array(50); var t0 = new Date; for (var i=0; i<10000000; i++) { y = x[1]; } print(new Date - t0); } f(true); -- 882 ms with -m -n, 22 ms after commenting out line 3. For the GETELEM x[1], the type of x is {void, Uint8Array}. To determine the typed array type we call getSingleObject, this returns NULL because there's also "void" in the type set and therefore we don't inline the GETELEM. This shouldn't be too hard to fix, we can probably just ignore the primitive types because there's already an "is-it-really-an-object" check if we're not sure. Other problems: 2) The type of the key is sometimes unknown, but in this case we insert an is-int32 check so this shouldn't stop us from inlining the GETELEM. 3) There's a float/void type barrier. For an Int*Array GETELEM this causes us to always allocate a type register, but we should still be able to inline it. 4) Some functions have many locals (like > 100, many of them functions). SSA analysis has a 50 locals limit somewhere. Maybe related to 2. Not sure how much 2-4 hurt perf though. 2 and 3 shouldn't cause us to be slower than JM-without-TI because it also has to check the key type and allocate a type register.

Jan de Mooij [:jandem]

Assignee

Comment 2

•

14 years ago

Attached patch Patch — Details — Splinter Review

Use getObject/getObjectCount instead of getSingleObject. The test case in comment 1 now runs in 26 ms. Still 22 ms without the "if". Also added a test to make sure we don't crash if the object becomes undefined (or another primitive type). The test crashes (of course) if I remove the is-object checks in getelem_typed or setelem_typed.

Assignee: general → jandemooij

Status: NEW → ASSIGNED

Attachment #545642 - Flags: review?(bhackett1024)

Brian Hackett [Laid off!]

Comment 3

•

14 years ago

Comment on attachment 545642 [details] [diff] [review] Patch Review of attachment 545642 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/methodjit/FastOps.cpp @@ +1455,4 @@ > // Inline typed array path. Look at the proto to determine the typed array type. > + types::TypeObject *object = types->getObject(0); > + int atype = object->proto->getClass() - TypedArray::slowClasses; > + JS_ASSERT(atype >= 0 && atype < TypedArray::TYPE_MAX); You should call addFreeze on types somewhere around here, I think at this point recompilation will only be triggered if the type tag or typed-arrayness of the object changes (a different type of typed array could come in without triggering a recompile). FWIW I think this freeze constraint stuff is over-complicated, wastes memory and probably doesn't buy us much. Will be simplifying pretty soon. @@ +1975,4 @@ > // Inline typed array path. Look at the proto to determine the typed array type. > + types::TypeObject *object = types->getObject(0); > + int atype = object->proto->getClass() - TypedArray::slowClasses; > + JS_ASSERT(atype >= 0 && atype < TypedArray::TYPE_MAX); ditto

Attachment #545642 - Flags: review?(bhackett1024) → review+

Jan de Mooij [:jandem]

Assignee

Comment 4

•

14 years ago

(In reply to comment #3) > You should call addFreeze on types somewhere around here, I think at this > point recompilation will only be triggered if the type tag or > typed-arrayness of the object changes (a different type of typed array could > come in without triggering a recompile). Hm yes makes sense. Reminds me btw that I still need to spend some time to understand what all the constraints do and when they're added. Pushed with the types->addFreeze(cx) calls added: http://hg.mozilla.org/projects/jaegermonkey/rev/8e009db2de15

Jan de Mooij [:jandem]

Assignee

Comment 5

•

14 years ago

Patch improved this from 15 seconds to 11-12 seconds on 32-bit. Still a lot of time in stubs::GetElem and the compiler. Need to get a profiling build to get meaningful numbers, but for now I'll focus on the GETELEM's.

Jan de Mooij [:jandem]

Assignee

Comment 6

•

14 years ago

Attached patch Iterate all objects — Details — Splinter Review

Iterate all objects in the type set to determine the typed array type. I also experimented with a new freeze-constraint to trigger recompilation only if an array of a different type is added. It worked fine but I didn't add it to this patch because it added complexity, didn't reduce the number of recompilations on jslinux and you mentioned that you were going to simplify the freeze constraints soon anyway.

Attachment #545905 - Flags: review?(bhackett1024)

Brian Hackett [Laid off!]

Comment 7

•

14 years ago

Comment on attachment 545905 [details] [diff] [review] Iterate all objects Review of attachment 545905 [details] [diff] [review]: -----------------------------------------------------------------

Attachment #545905 - Flags: review?(bhackett1024) → review+

Jan de Mooij [:jandem]

Assignee

Comment 8

•

14 years ago

Pushed: http://hg.mozilla.org/projects/jaegermonkey/rev/d7d452b4b90b Down to 9.7 seconds, 2x slower than FF 5. Just a few calls to stubs::GetElem remaining so we are able to inline them now. Next problem is reducing the number of recompilations. There's one really huge script of 32,754 bytes (probably about 10,000 ops). We recompile it 260 times. A lot of time in FrameState::sync etc.

Jan de Mooij [:jandem]

Assignee

Comment 9

•

14 years ago

(In reply to comment #8) > We recompile it 260 times. About a third of these are caused by recompiling for inlining: The script becomes hot and we recompile to inline calls. Then at some point we recompile again and reset the use count. Then the script becomes hot again, etc. IIUC, reducing the number of recompilations *not* for inlining will also reduce the number of recompilations for inlining because we will reset the use count less often. But it might still be good to limit the recompilations-for-inlining somehow.

Brian Hackett [Laid off!]

Comment 10

•

14 years ago

We probably shouldn't be inlining at all in scripts this large. The metrics we use for determining when to compile and inline are static, and not affected by either the size of the script or the number of times we have previously compiled it. They are currently 40 uses and 10000 uses, respectively, but I'm not sure how best to tune them (need to see more scripts that incur a big recompilation cost). Probably a linear scaling in script length could work (quadratic seems like it would over-penalize big scripts), but it needs numbers for 'a' and 'b' in the 'a*x + b'. To figure those out it would be cool if these constants (current and future ones) were configurable to make it easy to experiment with different tunings. e.g. it would be nice to know if making the '40' bigger would improve this situation without needing to recompile the browser. It would also be really super slick to be able to do partial compilation of scripts --- we see part of a script getting hot (a loop, or a subset of a loop), compile that part but don't compile the rest. Java does this I think. We spend less time in the compiler so can compile sooner, don't trigger recompilation from type changes in code we didn't compile. From an architecture standpoint, doing this in JM would be pretty simple --- we can already enter jitcode at loop heads from the interpreter, could make a C++ call to reenter the interpreter on leaving a compiled fragment, and more or less compile loop bodies independently from the rest of the script anyways. I think the requirements on IonMonkey will be similar, not sure yet whether to experiment with this now or make sure we will be able to do this easily in IonMonkey.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 11

•

14 years ago

(In reply to comment #10) > From an architecture standpoint, doing this in JM would be pretty simple --- > we can already enter jitcode at loop heads from the interpreter, could make > a C++ call to reenter the interpreter on leaving a compiled fragment, and > more or less compile loop bodies independently from the rest of the script > anyways. I think the requirements on IonMonkey will be similar, not sure > yet whether to experiment with this now or make sure we will be able to do > this easily in IonMonkey. This should be easy (in theory) since it has a CFG. We could trim parts of the graph we don't want. Wherever you trim off a branch that could be taken, the removed jump target would change to a bailout.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 12

•

14 years ago

Whoops, I accidentally answered the IRC thing instead of what you mentioned in this bug. I think compiling loop bodies independently should be possible too, once we have OSR - the bottom of the loop would be a bailout and the top of the loop would have an OSR entry point.

Jan de Mooij [:jandem]

Assignee

Comment 13

•

14 years ago

Most of the other recompilations are from stubs::TypeBarrierHelper for GETELEM ops. Two arrays that are used a lot are locals 2 and 20. Local 2 is a dense array: -- void object[1] #27:15 #27:15 : Array:prototype packed dense { (index): [own] unknown } -- Local 20 is an Uint8Array: -- #28:57 : Uint8Array:prototype typed { (index): [own] void int } -- For these arrays we recompile a lot because either: 1) the type is missing: -- #155:23310: 9 getlocal 20 type 0: void object[1] #28:57 #155:23313: 9 localinc 61 type 0: unknown #155:23316: 9 getelem typeset 1057: missing type 0: missing barrier: void int #155:23319: 9 setlocal 9 type 0: missing -- After recompiling there's a void barrier: -- #155:23310: 9 getlocal 20 type 0: void object[1] #28:57 #155:23313: 9 localinc 61 type 0: unknown #155:23316: 9 getelem typeset 1057: int type 0: int barrier: void #155:23319: 9 setlocal 9 type 0: int -- It looks like we repeat this process for many GETELEM's. The same happens for local 2, the dense array. The only difference is that the type there is "unknown" instead of "void or int". For a typed array, reading out-of-bounds values adds undefined to the element type. It would be really nice if we could only add undefined to the GETELEM where we do this and not penalize other GETELEM's. Or else, if the element type of a typed array is void/int we should assume GETELEM reads an int and add a void barrier or something. 2) To make things worse, at some point doubles are added to local 2, the dense array. We have to recompile a number of times to update the type barriers for many GETELEM's. I'd need to instrument the browser to get numbers but I'm sure that these two are responsible for a large number of compilations.

Jan de Mooij [:jandem]

Assignee

Comment 14

•

14 years ago

Oh btw, I will be mostly offline for a few weeks. So anybody, feel free to steal/close/clone this bug.

Boris Zbarsky [:bzbarsky]

Comment 15

•

14 years ago

Given that typed array lengths are immutable, would it make any sense to include the length as part of the type information of the typed array or something?

Brian Hackett [Laid off!]

Comment 16

•

14 years ago

The 'missing' observed type sets show up when the code has never actually run and produced a value, in which case we are compiling doomed code --- if it ever runs, it will trigger a recompilation. I think we could be smarter here on typed arrays, since their behavior is so heavily constrained (can't have non-number elements, for example). Currently the analysis doesn't treat typed arrays any differently from other arrays, they're just arrays of numbers, and for any property access the analysis is paranoid and doesn't want to put types in the observed set unless the code has actually run (the types might change by the time the code does run, and a different type entirely may be the only one observed at the site). Would rather not do this, though, as this approach doesn't get at the root cause, which is that we shouldn't be compiling code which has never executed. I kind of want to do partial compilation for JM+TI, as the architecture changes are minor and the main issues are organizational difficulties (where to keep compiled code for loops, how to bail into the interpreter without recompiling, how to trigger recompilation once a bailout gets hot enough) which apply equally to JM+TI and IonMonkey.

Jan de Mooij [:jandem]

Assignee

Comment 17

•

14 years ago

Attached file jslinux shell testcase — Details

Here's a shell version of jslinux, asked and got permission from Fabrice to post it here. Just unzip and run like this: $ a/b/js jslinux.js I "beautified" all files using jsbeautifier.org. Also hacked the terminal output using putstr to get it to nicely print its output. Terminal input does not work but we don't need it here. Perf is comparable to the website, maybe a tiny bit faster because the website uses a 10 ms setTimeout and I had to emulate setTimeout using an array and a loop. I get the following numbers on JM tip: js -m : 2.9 js -m -j -p : 3.0 js -m -n : 9.0 js (-j) : 28.5

Brian Hackett [Laid off!]

Comment 18

•

14 years ago

Attached patch WIP — Details — Splinter Review

Patch to avoid compiling code that is known to be unreachable. This works, and improves the '-m -n' time on my machine from 13s to 10s. We are still recompiling the huge switch function about 170 times, however, and messing around from the thresholds triggering compilation doesn't help much. May want a mechanism to compile large scripts one chunk at a time (for some definition of 'chunk') to avoid quadratic recompilation behavior in them. I'm also not convinced that compilation is the biggest bottleneck here, but don't have much support for this belief, and can't get shark or callgrind or any other perf tool to give me usable data.

Luke Wagner [:luke]

Comment 19

•

14 years ago

You can try just putting rdtsc in js::Compiler's constructor/destructor.

Jan de Mooij [:jandem]

Assignee

Comment 20

•

14 years ago

(In reply to Brian Hackett from comment #18) > I'm also not convinced that compilation is the biggest bottleneck here, but > don't have much support for this belief, and can't get shark or callgrind or > any other perf tool to give me usable data. Here's a small part of the call tree in Instruments (had to trim some lines): --- 5718.0ms 56.5% js::mjit::TryCompile 5655.0ms 55.8% js::mjit::Compiler::performCompilation 48.5% (3.3%) js::mjit::Compiler::generateMethod() 11.0% (0.8%) js::mjit::Compiler::inlineCallHelper 7.8% (0.2%) js::mjit::Compiler::jsop_getelem 5.1% (0.6%) js::mjit::Compiler::jsop_bitop(JSOp) 309.0ms 3.0% js::mjit::FrameState::syncForBranch 274.0ms 2.7% js::mjit::Compiler::jsop_localinc 253.0ms 2.5% js::mjit::Compiler::startLoop 206.0ms 2.0% js::mjit::Compiler::jsop_setelem 177.0ms 1.7% js::mjit::Compiler::jsop_binary --- And a list of functions sorted by self time: -- Running (Self) Symbol Name 590.0ms 5.8% js::mjit::FrameState::sync 364.0ms 3.5% js::mjit::Compiler::generateMethod() 255.0ms 2.5% js::mjit::FrameState::syncForAllocation 242.0ms 2.3% js::mjit::Compiler::finishThisUp 184.0ms 1.8% js::mjit::LoopState::analyzeLoopBody 167.0ms 1.6% js::mjit::StubCompiler::emitStubCall 162.0ms 1.6% __bzero 159.0ms 1.5% js_ValueToBoolean(js::Value const&) 155.0ms 1.5% js::mjit::FrameState::syncAndKill 145.0ms 1.4% JSC::X86Assembler::X86InstructionFormatter::memoryModRM 144.0ms 1.4% js::mjit::stubs::FixupArity 142.0ms 1.4% memmove$VARIANT$sse3x 141.0ms 1.3% JSC::X86Assembler::X86InstructionFormatter::oneByteOp -- If I'm reading this right we spent 56.5% under mjit::TryCompile. Not sure about the other 40% - most of it seems to be mjit code (although the mjit code confuses Instruments a bit) Time under stub calls ("Focus on subtree" seems to be the only way to get this data in Instruments): stubs::FixupArity: 144 ms stubs::ValueToBoolean: 135 ms stubs::LookupSwitch: 44 ms stubs::Ursh: 27 ms

Jan de Mooij [:jandem]

Assignee

Updated

•

14 years ago

Depends on: 689284

Brian Hackett [Laid off!]

Comment 21

•

13 years ago

This has interesting behavior with the patch in bug 706914. With 1000 byte compilation chunks, compilation time goes way down (4662ms -> 128ms) but boot time goes way up (9.1s -> 13s). Needs investigation.

Depends on: 706914

Brian Hackett [Laid off!]

Comment 22

•

13 years ago

The latest patch in bug 706914 fixes some issues that were hurting this benchmark --- compilation wasn't getting triggered properly when jumping between compilation chunks, and the optimized handling of table switches was disabled. The boot time with that patch is down to 3.5 seconds, which is still slower than plain JM but not grossly so. I suspect that fixing bug 704387 will help this test a lot.

Depends on: 704387

Guilherme Lima

Comment 23

•

12 years ago

Nightly 24.0a1 (2013-06-18) - Booted in 5.344s Google Chrome 27.0.1453.110 m - Booted in 6.599s Internet Explorer 10 - Booted in 15.081s Yay!

Jan de Mooij [:jandem]

Assignee

Comment 24

•

12 years ago

Yep the baseline compiler helped this get closer to JM-without-TI. I think we can close this now.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Patch 14 years ago Jan de Mooij [:jandem] 4.48 KB, patch	bhackett1024 : review+	Details \| Diff \| Splinter Review
Iterate all objects 14 years ago Jan de Mooij [:jandem] 5.70 KB, patch	bhackett1024 : review+	Details \| Diff \| Splinter Review
jslinux shell testcase 14 years ago Jan de Mooij [:jandem] 1.62 MB, application/zip		Details
WIP 14 years ago Brian Hackett [Laid off!] 54.37 KB, patch		Details \| Diff \| Splinter Review