Closed Bug 1380033 Opened 4 years ago Closed 4 years ago

Investigate tiered wasm compilation resource management + tiering strategy

Categories

(Core :: JavaScript Engine: JIT, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: lth, Assigned: lth)

References

Details

Attachments

(2 files, 3 obsolete files)

The scheduler for tier-2 compilation tasks in GlobalHelperThreadState::canStartWasmCompile() only allows one tier-2 compilation helper thread to run at once in normal circumstances (things are a little different when tier-2 compilation is backlogged).  But really we want to allow tier-2 compilation to proceed a little more rapidly than that and that needs careful investigation, because we can't get into a situation where tier-2 compilation bogs down the system; after all the idea of tiering is to improve response time.

Various ideas we've tossed about are:

- static scheduling, like we have now, but with a larger count than '1',
  maybe a function of cpuCount
- dynamic scheduling, allowing tier2 tasks to run on all threads, but perhaps
  with (much) smaller task sizes so that other tasks get time on the
  scheduler quickly

No doubt there are more options.
In another comment (bug 1277562, review of patch 9) Luke also comments re the tier2oversubscribed mechanism that throttles tier-1 and boosts tier-2 compilation:

> IIUC, the BackgroundWorkPossible() check in WasmCompile.cpp is to handle the single-core case (where no helper threads are making progress).  That's pretty unfortunate though since, a single-core machine will benefit the most (in absolute terms) from the baseline and it's likely that it'll have some idle cycles that will eventually allow tier-2 to complete.
>
> I was thinking of an alternative that keeps tiering for single-core: what if GetInitialCompileMode() returned CompileMode::Once if wasmTier2Worklist.length > 20?  That way we have a hard bound on how much we're keeping alive.  With that, even single-core should be fine.

Although single-core systems are a small part of our user population (less than 4% now and falling) it would be nice not to outright disable tiering on these systems.  There's not much we can do if threading is disabled but that's just a testing configuration; we can have threads on single-core systems in general.  So when considering scheduling policies we should look into not having this special case.

Also see ModuleGenerator::startFuncDefs, which has a similar guard.

Also consider the fact that we have guards on the number of compilations that can run in parallel (at most one tier-1 and one tier-2 task), because the helper thread state holds the compilation state; we want to move that state out into the tasks perhaps, so that we're not constrained in that way, but can overlap compilations if there are resources available.
Blocks: 1391196
Blocks: 1391197
No longer blocks: 1277562
Let's use this bug also to investigate when to do tiering, cf Luke's remarks quoted in comment 1.

We know, for example, that on 32-bit systems we might not want to do tiering for very large programs because we'll blow our limited executable memory budget.  That is, except if we can directly serialize tier-2 code that has not been written to executable memory, so that it will benefit from fast compilation on the first load and from fast execution on subsequent loads.

We suspect, also, that we don't want to tier small programs where Ion will be fast enough.  What's "fast enough"?  It depends on the speed of the system relative to the program size, most likely.  As a rough rule of thumb, if program X would compile in less than 1s with Ion then there's no sense in tiering it on a particular device.  The question is whether we can know such data ahead of time, whether we should be building up the knowledge over time, and/or whether we should start with an initial guess and refine it over time.  (And persist this knowledge on the device.)

Another concern is if tier-2 compilation causes OOM in other ongoing work.  And then there's the thread scheduling issue mentioned earlier.

(We also shouldn't tier if the site is going to be blocking progress on the IDB store of compiled code, but I don't know if we can easily detect that.  One hopes this will not happen, but sites that want smooth execution may do it anyway.)
Summary: Investigate tier-2 wasm compilation resource management → Investigate tiered wasm compilation resource management + tiering strategy
(In reply to Lars T Hansen [:lth] from comment #2)
> (We also shouldn't tier if the site is going to be blocking progress on the
> IDB store of compiled code, but I don't know if we can easily detect that. 
> One hopes this will not happen, but sites that want smooth execution may do
> it anyway.)

Hah, interesting point.  Eventually, I think this use case should be satisfied by an explicit "jit"/"aot" flag to compilation so that devs can say what they way.
Another note: when we land this we must also land changes to the testing infrastructure so that in addition to testing everything with --no-wasm-baseline (which we do now) we must test with --no-wasm-ion.  Likely, baseline will be disabled by the scheduler for most of our test code since it's too small to tier.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: -- → P1
Attached patch bug1380033-tiering-policy.patch (obsolete) — Splinter Review
A WIP, but it responds to changes in clock speed and number of cores, relative to the estimated work size.  It deals with the executable segment size.  And it's not too complicated.

One thing that is not quite right here is that it has to make assumptions on the basis of total module size, whereas we want to make assumptions on the basis of only the code section size.  To do the latter means that the tiering decision can't be made until parsing reaches the code section.  (Alas there's not a simple relationship between bytecode size and code section size, though 0.6-0.8 is typical for larger programs.)
Attached patch bug1380033-tiering-policy.patch (obsolete) — Splinter Review
Not meant to land yet (too many comments for one thing), but this is the form I envision for the first cut: we estimate compile time and don't tier below a fairly generous limit; and on 32-bit we estimate compiled code size and don't tier if we think there's a risk we'll blow the code size budget.

The art lies in the estimates.  Fortunately there are many linear (albeit architecture-dependent) relationships between bytecode size and both compile time and machine code size, see separate spreadsheet, so it's not very complicated.  The trickiest part lies in estimating system speed relative to reference systems.  For now, max cpu speed coupled with system type provide that information, and it's probably good enough.  I aim to run some experiments to validate this.

I expect these computations to be more robust than hard cutoffs by system type, say, once we pin down a somewhat reliable estimator for system speed.
Attachment #8905490 - Attachment is obsolete: true
Attachment #8906635 - Flags: feedback?(luke)
Comment on attachment 8906635 [details] [diff] [review]
bug1380033-tiering-policy.patch

Review of attachment 8906635 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this makes sense.  It's a fair amount of code given the seeming simplicity of the question (and the temptation to just use a trivial MIN < bytecode size < MAX predicate), but the logic is simple and there's just several important axes to take into consideration here, so I agree what you have here is reasonable.  Nice job distilling this down!

::: js/src/wasm/WasmCompile.cpp
@@ +197,5 @@
> +    // slightly lower max frequency than my new high-end Xeon too, but the Xeon
> +    // can barely keep up, even when tuned to avoid NUMA effects.
> +
> +    const double referenceDesktop = 3500; // Haswell i7-4850HQ, normally 2.3GHz
> +    const double referenceMobile = 2300;  // Tegra Cortex-A15

Of all the numeric constants in this heuristic, these alone stick out as being (1) somewhat mysterious in unit, (2) intimidating if anyone wants to adjust in the future.  That is, for everything else, you can basically run a few wasm workloads and take a simple metrics.  How you would go about measuring and tweaking these numbers is a lot more mysterious.

IIUC, in the overall MillisToIonCompileOnThisSystem() computation, BytecodesPerMs() already takes into account desktop vs. mobile differences, so could RelativeCPUSpeed() ignore desktop vs. mobile and purely be a function of cpu freq?

Secod, what if BytecodePerMs() was defined (and maybe renamed) to return bytecode-per-ms-on-a-high-end-device (of the given desktop/mobile class) and then the returned unit of RelativeCPUSpeed() was: the fraction of throughput of this device compared to a high-end device.  These units I think I could wrap my head around.

Lastly, since everything else is quantized, I was wondering if there was some value in quantizing this fraction as well (into, let's say, 3 buckets: low-end, mainstream, high-end).  This could expand later to take into account cache sizes, etc that we could get from SystemInfo.

@@ +318,5 @@
> +    return Max(1U, uint32_t(codeSize / BytecodesPerMs(cls) / RelativeCPUSpeed(cls)));
> +}
> +
> +// Figure out whether we should use tiered compilation or not.  This can be
> +// slightly expensive and should only be called once other guards are passed.

IIUC, the only expensive thing here is the GetSystemInfoByName() call.  What if instead of a polling based API (which should always return the same value), we have a struct of the relevant fields and JS::SetSystemInfo() passes in the struct (not a function pointer) that is then copied into some process-global.
Attachment #8906635 - Flags: feedback?(luke) → feedback+
(In reply to Luke Wagner [:luke] from comment #7)
> Comment on attachment 8906635 [details] [diff] [review]
> bug1380033-tiering-policy.patch
> 
> Review of attachment 8906635 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/wasm/WasmCompile.cpp
> @@ +197,5 @@
> > +    // slightly lower max frequency than my new high-end Xeon too, but the Xeon
> > +    // can barely keep up, even when tuned to avoid NUMA effects.
> > +
> > +    const double referenceDesktop = 3500; // Haswell i7-4850HQ, normally 2.3GHz
> > +    const double referenceMobile = 2300;  // Tegra Cortex-A15
> 
> Of all the numeric constants in this heuristic, these alone stick out as
> being (1) somewhat mysterious in unit, (2) intimidating if anyone wants to
> adjust in the future.

They are max MHz readings, I've clarified the names to reflect that.  

> That is, for everything else, you can basically run a
> few wasm workloads and take a simple metrics.

Also simple for these - we get them from nsSystemInfo or similar functionality.

However, as we've discussed the raw clock speed is hardly indicative of everything we want to measure.  But I mostly wanted those additional (unknown) factors to be functions of the clock speed readings and other inputs, not baked into these constants.

> IIUC, in the overall MillisToIonCompileOnThisSystem() computation,
> BytecodesPerMs() already takes into account desktop vs. mobile differences,
> so could RelativeCPUSpeed() ignore desktop vs. mobile and purely be a
> function of cpu freq?

I am a little concerned about this because these are different classes of systems and might behave differently across a spectrum of implementations, though as you say at the moment one can clearly be derived from the other.

> Secod, what if BytecodePerMs() was defined (and maybe renamed) to return
> bytecode-per-ms-on-a-high-end-device (of the given desktop/mobile class) and
> then the returned unit of RelativeCPUSpeed() was: the fraction of throughput
> of this device compared to a high-end device.  These units I think I could
> wrap my head around.
> 
> Lastly, since everything else is quantized, I was wondering if there was
> some value in quantizing this fraction as well (into, let's say, 3 buckets:
> low-end, mainstream, high-end).  This could expand later to take into
> account cache sizes, etc that we could get from SystemInfo.

We'll see.  I'll run some experiments with what I have to see what comes of that.

I've introduced another complication that turned out to be necessary, also related to a conversation we had: parallel speedup is not a reality.  So at the moment I use core_count^0.667 as the "effective core count", which lies closer to reality and is fine for devices that firefox runs on.  Here we really do want to discuss hyperthread vs real core, numa vs not, desktop vs mobile memory subsystem, and cache sizes, to get that exponent, or perhaps it ought to be a different kind of function such as a multiplier.

> 
> @@ +318,5 @@
> > +    return Max(1U, uint32_t(codeSize / BytecodesPerMs(cls) / RelativeCPUSpeed(cls)));
> > +}
> > +
> > +// Figure out whether we should use tiered compilation or not.  This can be
> > +// slightly expensive and should only be called once other guards are passed.
> 
> IIUC, the only expensive thing here is the GetSystemInfoByName() call.

For now :)  But point taken.  The comment dates to a time when the computations looked more complicated.

> What
> if instead of a polling based API (which should always return the same
> value), we have a struct of the relevant fields and JS::SetSystemInfo()
> passes in the struct (not a function pointer) that is then copied into some
> process-global.

Yeah, that's a good idea.
Some thoughts at the end of the day, after gathering a fair amount of data (though I want more):

The space estimate is based on solid empirical data and there's no reason we can't use this estimate to cut off tiering on 32-bit systems.  We should continue to refine it to take into account compilations that are in the pipeline, probably, and we can do that precisely because we have data relating code section size to machine code size.

The time estimate is hard to get right in a useful way because it requires a deeper understanding of the system's performance.  Clock speed + estimate of useful concurrency is too crude.  I keep overestimating the Ion time on fast systems with slow clocks (Haswell and recent ARMv7) and underestimating it on slow systems with fast clocks (older AMD).  Without deeper understanding of the system I can tweak for one or the other but not both.  That said, it is satisfying when it gets the estimate right: tiering became enabled when compiling Bullet on my Nexus 5X, as it should, because parallel Ion compilation really does take longer than the cutoff.

Thus, if I had to land a scheduler this week I would probably just use the wasm code section size as a proxy for Ion compilation time and just choose one fixed code section size per system class as a cutoff - if the program is smaller than the limit, don't tier; if it is larger, do.  If a system is way fast we will tier a little more often than we should; if it is way slow we won't tier quite often enough, and make the user wait.  As discussed earlier we can then work on refining our selection of system class, and have a larger selection (along the high-end / midrange / low-end dimension).
(In reply to Lars T Hansen [:lth] from comment #9)
> The space estimate is based on solid empirical data and there's no reason we
> can't use this estimate to cut off tiering on 32-bit systems.  We should
> continue to refine it to take into account compilations that are in the
> pipeline, probably, and we can do that precisely because we have data
> relating code section size to machine code size.

Makes sense, agreed.

> The time estimate is hard to get right in a useful way because it requires a
> deeper understanding of the system's performance.  [...]

Very interesting; it's useful to see the problem really did show up in practice.

> Thus, if I had to land a scheduler this week I would probably just use the
> wasm code section size as a proxy for Ion compilation time and just choose
> one fixed code section size per system class as a cutoff [...]

That makes sense as a way to start; it may be good enough for long time.
Simplified from the previous version as outlined above, and I've set it up to distinguish high / mid / low end systems in Classify() eventually.  Copious comments.

Three matters to discuss:

* Is 250ms a reasonable cutoff?  On the Nexus5X this means that Bullet will
  likely be tier-compiled (being the largest of the embenchen suite) but 
  the normal case for smallish libraries is that they will end up being
  ion-compiled directly.

* I've left the PolicySpew code in the patch (but disabled); I can take it
  out but it will likely bitrot.  Leaving it in, we'll have something to build
  on when revising the policy code over the next couple of months.

* I've left several TODO comments in because I think they're useful to have,
  but they're not tagged with any bug number (it would likely be this bug).

I removed the code that tells the JS engine about the CPU speed but I'll hang onto that patch.  We could generalize it to also tell the JS engine about core count, which we now have separate code for (and separate from Firefox's ditto code).
Attachment #8906635 - Attachment is obsolete: true
Attachment #8908070 - Flags: review?(luke)
Comment on attachment 8908070 [details] [diff] [review]
bug1380033-tiering-policy-v2.patch

Review of attachment 8908070 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like the right heuristic for now, nice work!  r+ with these, mostly nit, comments addressed:

::: js/src/wasm/WasmCompile.cpp
@@ +38,5 @@
> +// code that allows basis data to be measured on a system.
> +
> +// #define WASM_POLICY_SPEW
> +
> +#ifdef WASM_POLICY_SPEW

Could you remove the policy spew from the patch that lands?  (In general, my experience with spew is it tends to be useful for a focused period of development and then never used again so it sits around cluttering.  Plus, when you really need to debug a situation one often wants more intrusive custom spew so you just stuff some printf()s in a custom build.

@@ +260,5 @@
> +// Late-2013 MacBook Pro (2.6GHz quad hyperthreaded Haswell)
> +// Late-2015 Nexus 5X (1.4GHz quad Cortex-A53 + 1.8GHz dual Cortex-A57)
> +//
> +// In late-2017 the MBP is still fairly high-end relative to the consumer device
> +// market, but the Nexus 5X is probably no better than mid-range.

nit: Stating reference devices is useful; without high/mid/low-end being generally considered, the last paragraph doesn't seem necessary though, and even confusing (why use high-end for x64 but mid-range for arm?).

@@ +263,5 @@
> +// In late-2017 the MBP is still fairly high-end relative to the consumer device
> +// market, but the Nexus 5X is probably no better than mid-range.
> +
> +static const double x64BytecodesPerMs = 2100;
> +static const double x64ToX86Slowdown = 1.4;

nit: For symmetry with x64/arm, could we have x86BytecodesPerMs instead of the Slowdown ratio (with following \n removed)?  Because the slowdown ratio here isn't a general ratio of x86 / x64, it's specifically for the task of bytecode/ms which depends on both the processor perf and the volume of code emitted, so the direct empirical number seems more straightforward.

@@ +278,5 @@
> +static const double arm32MobileTierCutoff = arm32BytecodesPerMs * tierCutoffMs;
> +static const double arm64MobileTierCutoff = arm32MobileTierCutoff * x64ToX86Slowdown; // Guess
> +
> +static double
> +CodesizeCutoffOnThisSystem(SystemClass cls, uint32_t codeSize)

nit: given that the function takes SystemClass, the returned value isn't really on "this system" but rather for the given parameter, so I'd remove "OnThisSystem".

@@ +346,5 @@
> +    //
> +    // That said, this is a non-issue: as of September 2017 1-core was down to
> +    // 3.5% of our population and falling.  On Android I've not seen hard data,
> +    // but the cheapest Huaweis and Samsungs available at my local electronics
> +    // megastore all have quad-core CPUs.

nit: could you remove the last sentence and merge the second-to-last sentence with the preceding paragraph?

@@ +354,5 @@
> +        return false;
> +    }
> +
> +    DebugOnly<uint32_t> threadCount = HelperThreadState().threadCount;
> +    MOZ_ASSERT(threadCount >= cpuCount);

nit: can you fold into a single MOZ_ASSERT() line?

@@ +365,5 @@
> +    // scheduler.  That requires a more sophisticated scheduler, though.
> +    //
> +    // TODO: Here we might also worry about the threadCount that is currently
> +    // available, given what is going on elsewhere in the pipeline.  In
> +    // principle maxWasmCompilationThreads can encapsulate that, but it doesn't.

nit: I think the following line speaks for itself without this whole comment block.  We know everything here can be improved, but probably we'll just do it in response to problems, if any, so it doesn't seem like we need to record all the possible future extensions.

@@ +380,5 @@
> +    // Ion compilation on available cores must take long enough to be worth the
> +    // bother.
> +
> +    double cutoffSize = CodesizeCutoffOnThisSystem(cls, codeSize);
> +    double effective_cores = EffectiveCores(cls, cores);

nit: effectiveCores

@@ +402,5 @@
> +    //
> +    // TODO: For now we consider this module in isolation.  We should really
> +    // worry about what else is going on in this process and might be filling up
> +    // the code memory.  It's like we need some kind of code memory reservation
> +    // system.

nit: Could you append to last sentence "or JIT compilation for large modules"?

@@ +447,5 @@
>          return nullptr;
>  
> +    uint32_t codeSize;
> +    if (!d.peekSectionSize(SectionId::Code, &env, "code", &codeSize))
> +        codeSize = 0;

To avoid thinking about "what does it mean if this fails?", how about we make this function fail on return false.  I think that would also simplify the somewhat confusing controlflow of startSection().
Attachment #8908070 - Flags: review?(luke) → review+
(In reply to Luke Wagner [:luke] from comment #12)
> Comment on attachment 8908070 [details] [diff] [review]
> bug1380033-tiering-policy-v2.patch
> 
> Review of attachment 8908070 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +447,5 @@
> >          return nullptr;
> >  
> > +    uint32_t codeSize;
> > +    if (!d.peekSectionSize(SectionId::Code, &env, "code", &codeSize))
> > +        codeSize = 0;
> 
> To avoid thinking about "what does it mean if this fails?", how about we
> make this function fail on return false.  I think that would also simplify
> the somewhat confusing controlflow of startSection().

That doesn't quite work, notably for modules having no code section (common in the test suite).  I can guard on that and fake the "failed to start code section" message if the decoder is not at the end, but that still breaks a number of tests that expect other error messages.  For example, wasm/binary.js constructs an empty module with a header followed by a data section identifier, but then nothing.  The code section sniffing will find no code but also not end-of-module.  Returning a 'no code section' error here is wrong.
Policy spew code, preserved for posterity.
Addressing all review comments except for what to do if peekSectionSize returns false.

(Can't land until after the soft code freeze ends.)
Attachment #8908070 - Attachment is obsolete: true
Attachment #8909281 - Flags: review+
Blocks: 1402078
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/be26c0a0a56f
Tiering policy with time estimation. r=luke
Followup work directed to bug 1402257 and bug 1402265.
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a414ed3ef9e5
Tiering policy with space proxy. r=luke
https://hg.mozilla.org/mozilla-central/rev/a414ed3ef9e5
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
You need to log in before you can comment on or make changes to this bug.