Created attachment 597946 [details] [diff] [review]
hack that makes games playable
The basic problem we were seeing is that 3 different mandreel-generated games were getting terrible frame rates while Chrome ran fine. With a little instrumentation, it seems that most/all the pauses were recompiling large functions. The generated JS code (unfortunately I cannot post it) is basically a huge number of if/else branches which hit type guards that recompile the huge function. This isn't just a startup problem since new branches are continually being discovered.
I looked into the two biggest sources: in both cases we have type info about global variables and, instead of adding a normal subset relationship, we add a dynamic type barrier that is guaranteed to hit/recompile the first time it is executed. When I switched these to not use the type barrier in the trivial attached patch, most of the recompiles went away and we get Chrome-like speeds. There are still more recompiles happening that jank the game at random times so more investigation is needed. Ideally, Emscripten/Mandreel-generated code wouldn't contain any type guards so I think we need to continue hunting down the sources of recompiles.
In general, though, we should solve this problem not just for Emscripten-generated code. I'm not sure if the fix is "more fine-grained chunked compilation". Instead, what if we just said, for large scripts, that we deoptimize by not using type guards? I hear v8 doesn't use Crankshaft for large scripts. Again, for large Emscripten-generated code, simple local inference should give plenty precise type info. Who knows, with this strategy, perhaps we wouldn't even need chunked compilation?
Are you testing on x86 or x64? Chunk compilation only works on x86 --- issues with far jump repatching make things much more complicated on x64 (will be fixed when IM gets chunk compilation).
If you're seeing compilation time eating up all execution time even on x86, it's a bug in chunk compilation which should be fixed (chunk compilation was designed to fix exactly this situation). Deoptimizing code in large scripts just because they are large is a losing game and we will not be able to hit peak performance on these translated games if we go that route, regardless of whether v8 has the same design problem.
I wasn't aware of the x64 problem. I heard there are problems on x86, too, but I'll double-check.
> designed to fix exactly this situation). Deoptimizing code in large scripts
> just because they are large is a losing game and we will not be able to hit
> peak performance on these translated games if we go that route,
I disagree. Simple local inference for the large translated scripts tells you almost everything you need to know. I now have a large corpus that we can use to test this assumption, though.
So yes x86 completely solves the jitter caused by recompilation. It seems bad to have an optimization that so drastically affects performance but only works on one platform; it effectively breaks the portability of Mandreel/Emscripten, even on desktop. Uniformity certainly would have saved me a few hours of investigation. Are there any concrete plans to add the ARM/x64 version of the optimization?
(In reply to Brian Hackett (:bhackett) from comment #1)
> (will be fixed when IM gets chunk compilation).
There are no plans for IM to have chunked compilation AFAICS. There are non-trivial software engineering concerns.
(In reply to Luke Wagner [:luke] from comment #3)
> There are no plans for IM to have chunked compilation AFAICS. There are
> non-trivial software engineering concerns.
Yeah, not for initial release for sure. I am really interested in this comment though:
> Simple local inference for the large translated scripts tells you almost everything you need to know.
It will be great if we don't have to rearchitect IonBuilder.cpp and various optimization passes.
Yeah, the x64 situation sucks (comment 1 wasn't quite right --- chunked compilation works on ARM). It would be good to fix this for JM, but would basically require backporting the x64 jump mechanism from IM (which is really slick). Maybe not a huge task, I'll poke around.
IM won't have chunked compilation in its initial release, but I'd like to get it in before terribly long. JM chunked compilation was in part done as a prototype for IM, and seems to work fairly well so shouldn't need huge changes going forward. Beyond what was required for JM, the main additional difficulty in IM is the lack of canonical spill locations for locals and stack slots, which could slow down state synchronization when jumping between chunks. I doubt this will be a huge challenge to fix, though.
Also, I agree that local inference gives most information needed to generate fast code, but there is a lot more which full blown type inference will provide. If the goal is not so much 'make autotranslated JS faster' as 'make autotranslated JS as fast as NaCl' then the compiler can't give up on information that would be required to hit that goal. The simple example is hot code in a giant script which does a bunch of HEAP[i] accesses, where HEAP is a global variable with associated bakeinable components (elements address, etc.). Fully optimizing the access requires knowing exactly what HEAP is, which can't be derived from local information. With only local information, LICM and GVN can mitigate the cost of the HEAP accesses, but in scripts as large as complex as Emscripten and Mandreel can generate they will not be able to fully make up for the loss.
I was not suggesting not using type inference! What I was suggesting is not emitting type barriers (i.e. s/addSubsetBarrier/addSubset/) in large scripts. I completely agree with the goal of getting Emscripten-generated code at NaCl speeds. Looking at what we generate now and what we can do that we aren't doing now, I think we can actually get there without a huge amount of effort. Some of it is what you are saying about optimizing the singleton typed array, some of it is optimizing some simple Emscripten-emitted patterns. I was going to file some bugs on Monday and actually try my hand out on IM (after I wrap of the scope/c-p-g stuff).
(In reply to David Anderson [:dvander] from comment #4)
> (In reply to Luke Wagner [:luke] from comment #3)
> > Simple local inference for the large translated scripts tells you almost everything you need to know.
> It will be great if we don't have to rearchitect IonBuilder.cpp and various
> optimization passes.
After having spent some time looking at Emscripten-generated code, it looks like basically all operations flow locally directly from or to a typed array op (or type-forcing ops like "|0" which makes local inference simple and precise. I've got several big (605KLOC) Mandreel games so we can easily measure how much type barriers are improving our precision.
Great fast work on bug 728372!
I'll leave this bug open, repurposed with the less-urgent goal of testing my hypothesis in comments 0 and 7.
Any progress on the testing?
Well, we basically know that, for this well-typed emscripten code, we shouldn't be emitting any barriers. How we achieve that is still under discussion; it is part of a bigger new project for how to deal with these giant generated JS files (startup time, avoiding the choppy frames at the beginning, avoiding all recompilation after the initial compilation, etc). I'm not sure if there is any work left to be done in this bug, though since the problem in comment 0 was fixed.