eagerize and parallelize analysis and compilation stack

RESOLVED WORKSFORME

Status

()

defect
RESOLVED WORKSFORME
7 years ago
6 years ago

People

(Reporter: bhackett, Assigned: bhackett)

Tracking

(Blocks 2 bugs)

Other Branch
x86
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [js:t] [games:p2])

Attachments

(1 attachment, 3 obsolete attachments)

(Assignee)

Description

7 years ago
Summary of part of a long discussion with Luke.

Pages with lots of autotranslated JS (i.e. games) have a poor loading experience.  After spending time downloading and parsing the JS, we then go through long compilation stalls in the initial frames, and the page gets choppy whenever new code is encountered and compiled.  This choppiness is generally due to the need for TI to observe code running before types are added for property reads and function arguments.

This could be avoided by selectively not using type barriers, which would allow type information for code to be generated before it has even executed.  Doing this in general would severely degrade analysis precision, but done judiciously on code that behaves monomorphically (e.g. autotranslated JS) this would cut recompilation costs and allow code to be more eagerly compiled.

Doing compilation eagerly is also in general fraught with peril, depending on how much cold code the page contains.  Ideally these pages have been tuned enough to not contain lots of unused stuff, but doing that tuning blind would be tricky so I doubt this is the case in general.  Could have a meet-in-the-middle approach that does some compilation eagerly but not all (vulnerable to all the above sets of problems!) or add a way for the page to indicate functions that should be compiled eagerly (Erk).  Don't know.

In any case, if we do analysis/compilation eagerly then the next step is to do it in parallel, at a script granularity, utilizing all available cores.  The compilation stack would look like:

Serial:

Parse
Emit bytecode

Parallel:

Analyze bytecode, lifetimes, SSA
Analyze types
Compile

The first three analysis stages are already script local and need no synchronization.  TI is an interprocedural analysis and generates a global constraint graph, so parallelizing it looks daunting to start.  I think an approach of generating and solving script-local constraints in parallel, then hooking up cross-script/heap constraints and solving them serially would work pretty well and is simple.  Compilation is script-local, except for invalidation constraints it adds and a few places it can query properties on objects (watch out this doesn't make shape tables!).
(Assignee)

Updated

7 years ago
Assignee: general → bhackett1024
(In reply to Brian Hackett (:bhackett) from comment #0)
> Could have a
> meet-in-the-middle approach that does some compilation eagerly but not all
> (vulnerable to all the above sets of problems!) or add a way for the page to
> indicate functions that should be compiled eagerly (Erk).  Don't know.

I'm not sure if this is a real problem for the type of code that will trigger this eager compilation mode.  For one thing, parallel compilation should absorb some of the additional compilation time of any dead code.  For another, dead code penalizes download and parse time so, if there is a significant amount of it, it will already be more than worth any serious app dev's time to strip it.

> Ideally these pages have been
> tuned enough to not contain lots of unused stuff, but doing that tuning
> blind would be tricky so I doubt this is the case in general.

It shouldn't be hard for us to give devs a tool to identify dead code.  In fact, I think they could do it right now using the Debugger API.  Maybe I could entice jimb or someone to write and publish such a script to show off their new creation...

> for invalidation constraints it adds and a few places it can query
> properties on objects (watch out this doesn't make shape tables!).

PJS went the route of adding a runtime flag to avoid hashifying on reads.  With such a flag, we could add asserts in all the places which definitely shouldn't be hit by the parallel compilation path (pinchpoints like RunScript or type monitoring).
(Assignee)

Comment 2

7 years ago
(In reply to Luke Wagner [:luke] from comment #1)
> PJS went the route of adding a runtime flag to avoid hashifying on reads. 
> With such a flag, we could add asserts in all the places which definitely
> shouldn't be hit by the parallel compilation path (pinchpoints like
> RunScript or type monitoring).

For the WIP I'm writing, I'm basically refactoring changing all the analysis and compiler code to not take a JSContext, but take a new AnalysisContext* instead, which only permits thread local writes.  Whenever the analysis or compiler really really needs to access the VM (mainly for allocating GC things), it can get a JSContext* but this will force other threads to join and the rest of that analysis/compilation will be serial.  Not pretty but better I think than just aborting, which would be problematic for inference.
(In reply to Brian Hackett (:bhackett) from comment #2)
The AnalysisContext sounds great since it sounds like it will statically ensure we don't escape into the VM.  Also, serialization is better than aborting, but are there really cases where it is valid for the analysis to escape into the VM (and not a bug)?  If so, can we make an argument why emscripten-type codes would never take this serializing path?
(Assignee)

Comment 4

7 years ago
(In reply to Luke Wagner [:luke] from comment #3)
> (In reply to Brian Hackett (:bhackett) from comment #2)
> The AnalysisContext sounds great since it sounds like it will statically
> ensure we don't escape into the VM.  Also, serialization is better than
> aborting, but are there really cases where it is valid for the analysis to
> escape into the VM (and not a bug)?  If so, can we make an argument why
> emscripten-type codes would never take this serializing path?

I originally wanted to stipulate this, but ran into cases where we were initializing standard classes during inference, e.g. for property accesses on primitives.  Looking again, I now think these can be forced to only happen in a serial context, and unless there are other nasty cases (I don't remember any, but haven't gone through all the code yet) we should be able to avoid needing to do forced joining.

That would just leave GC thing allocations, which can just happen under a lock during parallel analysis/compilation.  The last elephant in the room is what to do when that allocation triggers a GC, as other threads could be blithely attempting to still do analysis/compilation.  I'd like to just disallow this (supersnappy already takes this step), the number of GC things allocated during analysis is puny (fill in lazy type objects for accessed singletons, prototype objects for allocation sites, ...) and trying to do a last ditch GC if one of these fails wouldn't help in many cases anyways, since these stages do so much more allocation from the heap and if those fail will just OOM.
(In reply to Brian Hackett (:bhackett) from comment #4)
Sounds good
Whiteboard: [js:t]
(Assignee)

Comment 6

7 years ago
Posted patch WIP (bfa21a56f646) (obsolete) — Splinter Review
WIP, mostly feature complete but doesn't compile.

I ended up going with forced-serialization after all.  There are still a variety of places the analysis and compiler need to enter the VM to do minor things (not invoking e.g. object callbacks or anything that can call script!), and times when the compiler needs to modify shared state that other threads will otherwise try to access lock free, particularly the set of properties attached to type objects.  These are generally gated on particular opcodes being in the script (e.g. GETPROP/SETPROP for the latter) so coming up with some restrictions that ensure the analysis/compilation will stay parallel should be straightforward.
What's the scope of this bug--analysis in parallel or compilation too? And if compilers, which ones?
(Assignee)

Comment 8

7 years ago
This covers both analysis and JM compilation in parallel.  The compiler just needs minor changes, mainly to use an AnalysisContext instead of JSContext, and it should be simple to extend this to IM.
(Assignee)

Comment 9

7 years ago
Posted patch WIP (bfa21a56f646) (obsolete) — Splinter Review
WIP, seems to work but needs more tuning and experimenting.  Will come back after landing bug 762561.
Attachment #636220 - Attachment is obsolete: true
(Assignee)

Comment 10

7 years ago
(In reply to Brian Hackett (:bhackett) from comment #0)
> Doing compilation eagerly is also in general fraught with peril, depending
> on how much cold code the page contains.  Ideally these pages have been
> tuned enough to not contain lots of unused stuff, but doing that tuning
> blind would be tricky so I doubt this is the case in general.  Could have a
> meet-in-the-middle approach that does some compilation eagerly but not all
> (vulnerable to all the above sets of problems!) or add a way for the page to
> indicate functions that should be compiled eagerly (Erk).  Don't know.

This problem is hitting me on the demo you showed me on thursday, luke.  This page has a lot of JS, 3805 scripts in the compartment and 26MB total of bytecode, but when running it normally we only compile 640 scripts and 7.4MB of bytecode.  I don't know how much of the remainder is called but just cold, not called at all (but could be called in some cases), or truly dead.  The former two cases can't really be filtered out without inducing lags in the cases where the code is dynamically loaded, and in any case I think it's a tall order to require people to edit their code bases down to keep some compiler optimization from wasting time and (especially) memory.

I think that given programs which are mixing cold and warm/hot code together, there should really be a way for the page to indicate ahead of time what should be compiled.  I think an interface with two functions would work OK:

mozDumpCompilationManifest();

Calling this after running the demo / playing the game / whatever dumps a wad of JSON to the console indicating what was compiled.  Developers can copy this manifest into the JS itself, and when shipping the code can call:

mozExecuteCompilationManifest(manifest);

Which will compile all functions specified in the manifest, front loading the work to avoid jitter when new code is discovered later.

The format for the manifest can be pretty simple.  For these typed array heavy programs the manifest just needs the names of the functions to compile and the types of their arguments (in case calls to the function were not compiled), possibly also the types of the return values of callees which were not compiled.

function foo(x);
function bar() { foo(3); }

To compile foo and bar:

[{"name":"foo", "compile":"true", "arguments":[{"name":"x", "type":"int"}]}, {"name":"bar", "compile":"true"}]
(In reply to Brian Hackett (:bhackett) from comment #10)
> The former two cases can't really be filtered out without inducing lags in
> the cases where the code is dynamically loaded, and in any case I think it's
> a tall order to require people to edit their code bases down to keep some
> compiler optimization from wasting time and (especially) memory.

If we are talking about professional gamedevs, I don't think it is.  The JS you are looking at is a minimum viable demo, and I don't think should be taken as representative.  (To get a better idea, I think we should make a stripped version and continue working with that.)  Gamedevs regularly deal with size/memory/speed limits and are used to optimizing these things (including the authors of the the JS in question).  Again, removing dead code helps more than just jit-compile time and thus gamedevs will have motivation to do it regardless.

> I think that given programs which are mixing cold and warm/hot code
> together, there should really be a way for the page to indicate ahead of
> time what should be compiled.  I think an interface with two functions would
> work OK:

I think this goes in the direction of "adding mozilla-specific cheats" which we wanted to avoid if at all possible.  (Yes, this is what I earlier proposed doing (except with hints and types inline), but I have been thoroughly disabused of this idea by dherman and azakai.)
(Assignee)

Comment 12

7 years ago
OK, it sounds like I could use some disabusing.  Do you have ideas for how we should distinguish code that has been space optimized from code that hasn't?  This is a very hard problem, and if we try to fix it with a bunch of heuristics we will be guessing wrong some/much of the time and either be ballooning the browser's memory usage or forcing developers to go through a song and dance to convince the browser to do what they want it to do.

Some problems really do want to be solved using APIs, and I think this is one of them.  Is window.mozRequestAnimationFrame a cheat?  No, and other browsers have parallel non-standard APIs of course.  Other browsers might want help on this page too; on my machine, this demo hangs Chrome.  This is pretty different from what you proposed earlier, using a special analysis mode and tight constraints on the type of code that could be analyzed.  Using an API like this puts no restrictions on the code that can be handled (well, maybe "name your functions") and would be simple for a developer to incorporate regardless of what middle end tools they are using or whether they just hand wrote the program.
(In reply to Brian Hackett (:bhackett) from comment #12)
> Do you have ideas for how
> we should distinguish code that has been space optimized from code that
> hasn't?

My answer was "don't".

I'm actually in favor of eventually adding *some* API.  But if we just ram one in now and make it a precondition for getting *any* of this eager/fast compilation, then I think it'll feel cheat-y and like we circumvented the standards process (especially with the part about declaring formal arg types).  Can we do simple eager-everything compilation now while also proposing to standardize a new API?  Once/if that API gains traction (whatever that means), we can add it and I don't think we'll have wasted any effort.  But, in the meantime we can make forward progress on generated JS that hasn't been specifically optimized for FF.

Note: this wouldn't be a good idea if we believed there was a risk of regressing perf on a number of real sites, but given the pre-conditions for perf fault here (large typed array, enormous script large parts of it cold), it doesn't seem likely.
My 2c, I want to argue that stalls after initial load are much worse than stalls during initial load.

The reason is that initial load can be masked in various ways so it is less noticeable. And the beginnings of games are not the most intensive. Furthermore, in fact in some games a stuttery startup is considered normal, for example in Second Life and similar products (for example https://its.cloudpartytime.com , that's basically Second Life in WebGL) - they throw you into a partial world and then start to fill it up with content.

But stalls later can be very bad, if they happen at the worst possible time, to take Luke's example, when the end of level boss appears.

It seems to me like the stalls later can be fixed much more easily than the stalls at the beginning. Imagine that if we see a web page is using significant % of CPU for several seconds, it assumes the page is running something heavy (which is true). It then fires up a background thread that eagerly compiles uncompiled code, as well as performs more heavy optimizations on previously-compiled code. If the web page stops using a significant amount of CPU, the thread is paused or quit, but if not, it will eventually finish compiling the entire page's JS, preventing stalls later on.

This doesn't need new APIs, and the hint for when to do the background work is the minimal one, but also one that is guaranteed to be sufficient.
(Assignee)

Comment 15

7 years ago
Posted patch patch (735f9847ef01) (obsolete) — Splinter Review
Working patch, using the API hooks outlined in comment 10.  To my eye, noticeably improves the demo's startup behavior.

I'm not going to try to land this until we get some kind of agreement; I'm already swimming upstream on enough projects.

I only know how to optimize the things that are right in front of me.  On this demo, both of you are pushing for an approach that will take us from performing pretty well with some jitter to where we just run low on memory and stall/crash.

Making a change which slightly improves well tuned sites and tanks performance on poorly tuned sites is an unacceptable trade/off.
Attachment #636858 - Attachment is obsolete: true
(In reply to Brian Hackett (:bhackett) from comment #15)
> Created attachment 637912 [details] [diff] [review]
> patch (735f9847ef01)
> 
> I only know how to optimize the things that are right in front of me.  On
> this demo, both of you are pushing for an approach that will take us from
> performing pretty well with some jitter to where we just run low on memory
> and stall/crash.
> 

Fair point, without an API like you proposed there is indeed risk of
bad effects on other cases. I think it's unlikely if we use a
heuristic like using a lot of CPU over enough time, but I could be
wrong, we don't have enough data.

A simpler API might be easier to agree on, perhaps something like

window.requestAOTC(callback)

which asks the browser to do ahead of time compilation
of the entire page, and the callback is called when AOTC
completes. Only websites that know they benefit from eager
compilation would call that, so there would be minimal risk
to the web.

Personally my main concern though is eventual performance,
less about stutter. Luke argues the two are connected, that
with eager compilation we can not only reduce stutter but
also improve eventual performance, but if there is a
simpler approach to better eventual performance without
eager compilation that would be more important in the
short to medium term IMO. Right now we run compiled code
at around 5x slower than native, getting that much lower
is more urgent than reducing stutter IMO. Being able to
run 10 bots instead of 2 bots in BananaBread would be
huge, for example.
(In reply to Brian Hackett (:bhackett) from comment #15)
> On this demo, both of you are pushing for an approach that will take us from
> performing pretty well with some jitter to where we just run low on memory
> and stall/crash.

Care to expound?  What is taking all the memory: jit-code, temporary TI data, retained TI data, something else?  The bytecode is 35MB, where is the 100x blow-up?  Is OOM causing the crash?  When you shark the stall, is it all jit-compilation or TI?  Given that the frontend can scan/parse this 35MB in < 3s, something seems wrong.
(Assignee)

Comment 18

7 years ago
(In reply to Luke Wagner [:luke] from comment #17)
> (In reply to Brian Hackett (:bhackett) from comment #15)
> > On this demo, both of you are pushing for an approach that will take us from
> > performing pretty well with some jitter to where we just run low on memory
> > and stall/crash.
> 
> Care to expound?  What is taking all the memory: jit-code, temporary TI
> data, retained TI data, something else?  The bytecode is 35MB, where is the
> 100x blow-up?  Is OOM causing the crash?  When you shark the stall, is it
> all jit-compilation or TI?  Given that the frontend can scan/parse this 35MB
> in < 3s, something seems wrong.

It's not easy to measure precisely because about:memory doesn't work when the browser is stalling, but it's presumably a mix of all of the above.  On my computer the browser starts paging when it gets above 1.4 GB or so, and with 400MB of data allocated for the typed array and another 100-200MB for other JS and browser structures, it doesn't take much extra to hit the limit.  Maybe things would look better on a better computer, but in general it's a bad idea to spend resources needlessly.
(In reply to Brian Hackett (:bhackett) from comment #18)
> but in general
> it's a bad idea to spend resources needlessly.

Agreed, but I'm trying to understand why we are spending so many resources.  By your numbers, it sounds like you are getting 800MB of something and that sounds like a bug.
Dave will correct me if I'm wrong, but I think the other day he told me that APIs for interacting with the compiler are OK. As a small bikeshed, I think that instead of dumping functions directly in |window|, we should have an object representing the JS compiler: window.compiler.aot(config) similar to window.performance.now.
I'd say I find the idea of a user-facing API appealing, but I'm not sure I've thought through all the consequences. The biggest danger of something like that is that people will abuse and misuse it ("I heard X makes JS faster! X ALL THE THINGS"). An async API is at least less prone to abuse, since it doesn't block the main thread of course and since it imposes a bit of an up-front cost.

Dave
Where is this at?

After thinking about it more, I've started to think of a compiler-control API as a last resort. Mostly it's because the history of performance tuning APIs is not that good, especially on client code distributed to a wide variety of devices. But I'm also doubtful about whether any other browser would ever like such a thing.

Brian, do you know where all the memory is going when it blows up? I'd like to know if it looks like the current direction is able to move forward.
Agreed a compilation API is a last resort. (I'm not sure how other browser vendors would feel about it.) But it seems like some combination of eager analysis (if we can afford it) and a background compilation thread make sense to try first.

Dave
(Assignee)

Updated

7 years ago
Depends on: 772820
(Assignee)

Comment 24

7 years ago
Apologies for the large size of the patch, much of jsanalyze.cpp, jsinfer.cpp and the mjit are just s/JSContext/AnalysisContext/ stuff.  This factors out JSContext so that it is only rarely used during analysis and compilation, and allows these to happen in parallel.

This removes the new hooks from previous patches, and doesn't replace them with anything --- there is no actual way to trigger parallel analysis or compilation.  I want to get this into the tree, though, for a few reasons:

- Mainly, soon I want to start doing IM compilation off thread, and it'd be nice to leverage this infrastructure.  Will file that tonight.

- AnalysisContext is a nice abstraction and pretty cleanly divides analysis/compilation work from the rest of the VM.

- I think this bug is at an impasse and having the basics in the tree will make things easier for people to experiment with heuristics on other pages/demos.
Attachment #637912 - Attachment is obsolete: true
Attachment #642105 - Flags: review?(luke)
(Assignee)

Updated

7 years ago
Blocks: 774253
Comment on attachment 642105 [details] [diff] [review]
patch (30a4cd976842)

I'd be fine reviewing/landing a patch that makes purely aesthetic changes (like s/JSContext/AnalysisContext/), but this patch does a whole lot more than that.  There are still basic unanswered questions (comment 19, asked again in comment 22).  Also, when we do actually decide to add real multi-threaded compilation, it can't just be some giant patchbomb, but something small that we can have several people carefully analyze; as you well know, making previously-single-threaded code run on multiple threads is serious business.
Attachment #642105 - Flags: review?(luke) → review-
Flagging this as P2 for games, since it can have a significant impact on startup time for large emscripten-compiled apps.
Whiteboard: [js:t] → [js:t] [games:p2]
Is this still relevant, given asm.js parser work?
(Assignee)

Comment 28

6 years ago
The analysis stack doesn't really exist anymore, and Ion compilation mostly occurs off thread, so I don't think this bug is relevant.  I'm not sure whether asm.js work is relevant, unless our position is that all games work should be done through asm.js.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.