Closed Bug 1320374 Opened 8 years ago Closed 7 years ago

Baldr: batch compilations

Categories

(Core :: JavaScript Engine: JIT, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox53 --- affected

People

(Reporter: bbouvier, Assigned: bbouvier)

References

Details

Attachments

(8 files, 6 obsolete files)

2.68 KB, patch
luke
: review+
Details | Diff | Splinter Review
3.95 KB, patch
luke
: review+
Details | Diff | Splinter Review
8.73 KB, patch
luke
: review+
Details | Diff | Splinter Review
22.42 KB, patch
luke
: review+
Details | Diff | Splinter Review
32.31 KB, patch
luke
: review+
Details | Diff | Splinter Review
11.66 KB, patch
luke
: review+
Details | Diff | Splinter Review
7.16 KB, patch
luke
: review+
Details | Diff | Splinter Review
37.50 KB, patch
Details | Diff | Splinter Review
wasm generators tend to create a lot of very small functions and fewer big ones (in size of bytecode). Batching compilation to avoid to launch a task for every single small function could be highly beneficial, according to some measurements luke made (esp. for the baseline compiler).
A simple renaming that went through the net.
Attachment #8814478 - Flags: review?(luke)
Now that the code base has stabilized a bit, it's obvious that MG::numFuncDefs() can be removed and ME::numFuncDefs() be used instead. For the numFuncImports() one, it's more tricky: asm.js can call it iff we've started generating func defs, which the ME isn't aware of. That could be a boolean to pass as a parameter to the function numFuncImports(), but that would be quite a bizarre API.
Attachment #8814479 - Flags: review?(luke)
This moves IonCompileTask* out of the FunctionGenerator, which makes things easier for the batching patch.

It also delays the need for an IonCompileTask from the FunctionGenerator (and so we can do the entire validation phase *before* needing a task, now). I thought that would be useful and a perf improvement, but on my machine at least, it occurred only one time that a FG was awaiting a new task at this point, with AngryBots, so no differences.
Attachment #8814480 - Flags: review?(luke)
Attached patch 4.rename-ioncompiletask.patch (obsolete) — Splinter Review
This renames:
- IonCompileTask to CompileTask, since this is really what it is (it's used for both rabaldr and baldr)
- the former CompileTask (launched by WebAssembly.compile) to CompilePromiseTask, which is a bit wordier but makes it explicit that it's a PromiseTask too.
- InstanciateTask to InstanciatePromiseTask, for symmetry with the latter.
Attachment #8814482 - Flags: review?(luke)
Attached patch 5.batching.patch (obsolete) — Splinter Review
This is a proof of concept after an afternoon of hack. After it compiled and after one adjustment to the TempAllocator (which needs to be re-created after the lifo is reset, duh), all the basic wasm/asm.js jit-tests were passing.

It does the batching by bytecode size:
1. every time we want to compile a new function, we accumulate its length against a sum of functions awaiting to get compiled; after the batching threshold is met, we launch an off thread compilation.
2. in finishFuncDefs, we manually trigger a compilation for the remaining batched elements to be compiled
3. big functions get their own thread.

I have a few ideas for already making this better:
- for 1, the batching threshold is just a random choice so far, so it could be tailored to a specific benchmark / set of benchmarks.
- for 3, I use the batching threshold for determining what are "big functions", but using another different one could probably help too.
- for 2, I think that for the last numTasks functions, we can assign each a compilation thread, to not have to trigger the final batch compilation in finishFuncDefs.

First results are promising, with some outliers that I want to investigate. These are the results on one run on my machine:

http://bnjbvr.github.io/simplegraph/#title=Batching&ytitle=Time%20(in%20ms)&categories=1%20thread%2C2%20threads%2C3%20threads%2C4%20threads%2C5%20threads%2C6%20threads%2C7%20threads%2C8%20threads&values=before-ion%204277%202279%201635%201292%201203%201126%201046%20971%0Abefore-baseline%20638%20532%20487%20511%20540%20538%20536%20520%0Aafter-ion%204162%202333%201662%201244%201144%201052%20959%20898%0Aafter-baseline%20526%20615%20409%20412%20356%20385%20385%20378
(In reply to Benjamin Bouvier [:bbouvier] from comment #4)
> This renames:
> - InstanciateTask to InstanciatePromiseTask, for symmetry with the latter.

Sorry for the nit, but the spelling grates me a little. Any reason we can't stay with "instantiate"?
Attached patch 4.renaming.patchSplinter Review
(In reply to pipcet from comment #6)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #4)
> > This renames:
> > - InstanciateTask to InstanciatePromiseTask, for symmetry with the latter.
> 
> Sorry for the nit, but the spelling grates me a little. Any reason we can't
> stay with "instantiate"?

Oops, you're obviously right: I've renamed to InstantiatePromiseTask, thanks for the heads-up!
Attachment #8814482 - Attachment is obsolete: true
Attachment #8814482 - Flags: review?(luke)
Attachment #8814555 - Flags: review?(luke)
Attached patch 5.batching.patch (obsolete) — Splinter Review
Here's an updated patch, which I think is in a good state:
- now big functions get prioritized and compiled as they show up. The hard question is to know what's "big", of course.
- the previous patch was using two different structures for representing a function compilation's input and the resulting structure (input + offsets). The two have been merged for simplicity, and the offsets remain dormant until the compilation actually happens (and that allows using one more Move).
- the thresholds for the batching size and the big function's size have been promoted to being JitOptions, which makes it much easier to test against different values.

I've done some experiments with the thresholds, and it was quite fun: we have two criteria (batching size + size of big function), two objectives (minimize time of ion's compilation + minimize time of baseline's compilation). Of course the two objectives don't want the same criteria values, so I've taken the hypothesis that it's more important to make the baseline even faster, because ion can wait in the background.

For what it's worth, in AngryBots, here are some statistics about functions' sizes:
- minimum function size is 3 bytes (call imports).
- maximum is 26413 bytes.
- median function size is 136 bytes.
- 90th percentile is 749 bytes (90% of functions have fewer or as much as 749 bytes).
- 95th percentile is 1232 bytes.

After some exploration of the criteria space, looking at the two compilation modes and all number of threads between 1 and 8, it seems the best behavior is triggered with a batching threshold of around 2000 bytes and a big function threshold of around 1000 bytes. Of course, these are only heuristics, so that's probably overfitting AngryBots and my machine, but we have to start somewhere.

Posting up to date results in a bit.
Attachment #8814613 - Flags: review?(luke)
And here are the results for compiling AngryBots:

http://bnjbvr.github.io/simplegraph/#title=Batching%20compilation%20(median%20of%20a%20sample%20of%2011%20values)&ytitle=Time%20(in%20ms)&categories=8%20threads%2C7%20threads%2C6%20threads%2C5%20threads%2C4%20threads%2C3%20threads%2C2%20threads%2C1%20thread&values=before-ion%20983.597900390625%20980.6318359375%201048.75%201123.72998046875%201212.06201171875%201298.89697265625%201635.816162109375%202310.3701171875%0Abefore-baseline%20526.58203125%20523.998046875%20524.56298828125%20522.641845703125%20510.5009765625%20504.359130859375%20503.56298828125%20589.76708984375%0Aafter-ion%20894.7919921875%20893.552978515625%20978.14697265625%201063.2919921875%201161.94189453125%201283.053955078125%201623.204833984375%202315.56494140625%0Aafter-baseline%20357.089111328125%20356.578857421875%20360.64990234375%20367.593994140625%20361.572021484375%20351.635009765625%20385.873046875%20521.52197265625

TL;DR:
- up to 33% speedup on baseline compilation
- up to 9% speedup on ion compilation

Some quick analysis:
- a combination of "using the median of 11 tries" and "big functions are compiled right away" removes all the aberrations from the previous set of results (where the patch was resulting in slower compilation, in some cases).
- ion is 5ms slower with a single thread, which one can consider negligible; but baseline compilation is already 10% faster even with a single thread!
- I haven't done the per-section measures, but the graph suggests that after 3 threads, baseline reaches its peak compilation speed (and actually even slows down); this suggests again that there's some overhead somewhere else, and that this overhead was already there before this set of patches.
Attachment #8814484 - Attachment is obsolete: true
Very nice work.  Can you post a link to a more detailed histogram of function sizes?  What you have above is already very suggestive but I'm curious...
(In reply to Lars T Hansen [:lth] from comment #10)
> Very nice work.  Can you post a link to a more detailed histogram of
> function sizes?  What you have above is already very suggestive but I'm
> curious...

Sure! Sharing the URL is a bit tedious here, so I've exported an histogram chart made with simplegraph:
https://benj.me/pub/angrybots-sizes.png

If you want to do your own statistics, here's the raw data:
https://benj.me/pub/angrybots-sizes.txt
Great work; I'm excited to start reviewing!  It is indeed surprising that baseline maxes out so quickly (but then again, 350ms is a great absolute time).

One possibility is that we're saturating the memory controller; it'd be nice to find the performance counters that measure memory-related stalls and confirm with `perf stat`.  Mobile might have a significantly different CPU/memory speed ratio, so it'd be interesting to see what the numbers look like there too.
(In reply to Luke Wagner [:luke] from comment #12)
> 
> One possibility is that we're saturating the memory controller; 

+1, there should be a large amount of write activity.

(And if it's not that then we could be running into some non-scalable malloc behavior.)
Attachment #8814478 - Flags: review?(luke) → review+
Comment on attachment 8814479 [details] [diff] [review]
2.remove-numfuncdefs.patch

Review of attachment 8814479 [details] [diff] [review]:
-----------------------------------------------------------------

Good catch!  For numFuncImports(), I think the only asm.js uses are from within MG so perhaps you could make MG::numFuncImports() protected and use ME::numFuncImports() for the remaining wasm-only uses.
Attachment #8814479 - Flags: review?(luke) → review+
Comment on attachment 8814480 [details] [diff] [review]
3.move-task-outof-fg.patch

Review of attachment 8814480 [details] [diff] [review]:
-----------------------------------------------------------------

Cool idea; seems like a nice simplification even if it's not a speedup.

::: js/src/wasm/WasmGenerator.cpp
@@ +871,5 @@
>  
> +    if (!freeBytes_.empty()) {
> +        fg->bytes_ = Move(freeBytes_.back());
> +        freeBytes_.popBack();
> +        fg->bytes_->clear();

nit: I wonder if it would be more logical for task->reset() to clear() the recycled bytes it returns...

@@ +904,5 @@
> +        return false;
> +
> +    IonCompileTask* task = freeTasks_.popCopy();
> +
> +    task->init(Move(func), mode);

nit: I'd remove the intervening \n
Attachment #8814480 - Flags: review?(luke) → review+
Comment on attachment 8814555 [details] [diff] [review]
4.renaming.patch

Review of attachment 8814555 [details] [diff] [review]:
-----------------------------------------------------------------

I like the symmetry here; this has nagged me for a while :)  What about if we took it a bit farther and:
 - moved the _declaration_ of wasm::CompileFunction() to vm/HelperThreads.h (using only a forward declaration of wasm::CompileTask, since vm/HelperThreads.cpp doesn't need any details; this removes the #include of WasmIonCompile.h from HelperThreads.cpp)
 - moved everything else (including the _definition_ of wasm::CompileFunction) other than wasm::IonCompileFunction to WasmGenerator.(h,cpp)
This would let Wasm(Ion|Baseline)Compile.(h,cpp) be completely symmetric.

::: js/src/wasm/WasmGenerator.cpp
@@ +859,5 @@
>  
> +    if (!freeBytes_.reserve(numTasks))
> +        return false;
> +    for (size_t i = 0; i < numTasks; i++)
> +        freeBytes_.infallibleAppend(js::MakeUnique<Bytes>());

I suppose this logically goes with the previous patch?
Attachment #8814555 - Flags: review?(luke) → review+
Attached patch 5.batch.patch (obsolete) — Splinter Review
Thanks for the reviews! I've addressed all the comments, so uploading a new patch because of all the code motion.
Attachment #8814613 - Attachment is obsolete: true
Attachment #8814613 - Flags: review?(luke)
Attachment #8815243 - Flags: review?(luke)
Pushed by bbouvier@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78f16019ce10
Rename THREAD_TYPE_ASMJS into THREAD_TYPE_WASM; r=luke
https://hg.mozilla.org/integration/mozilla-inbound/rev/3e47198ee2eb
Remove numFuncDefs() from the ModuleGenerator; r=luke
https://hg.mozilla.org/integration/mozilla-inbound/rev/636dbc0dc256
wasm: move IonCompileTask out of FunctionGenerator; r=luke
https://hg.mozilla.org/integration/mozilla-inbound/rev/40642c7dbe9a
Rename IonCompileTask to CompileTask and {Compile,Instantiate}Task to {1}PromiseTask; r=luke
Given that bug 1317033 comment 1 was the speedup that it was (and it just removes a few mallocs), I was thinking that maybe we could get a small boost by:
 - storing Bytes in FuncBytes by value
 - changing freeBytes_ to a Vector<UniqueFuncBytes>
thus we'd be reusing FuncBytes the same way we reuse Bytes now.
See also bug 1321146 for huge improvement on baseline speed from commenting out the 4 gc:: page-protection calls in PageProtectingVector.h.  This brings my time from 600ms down to 400ms given several threads; excited to see what it does to bbouvier's numbers with this 3-year-newer CPU...
I played some with JIT_OPTION_wasmBatchThreshold and noticed that the current default (2,000) seems basically optimal for Ion, but I was able to get continual improvements up to about 10,000 for baseline, ultimately yielding a 10% speedup.  This makes intuitive sense: given that our unit is bytes-of-bytecode and Ion does more work-per-bytecode than baseline.  Perhaps then we should have a limit for each Ion and baseline with defaults of 2,000/10,000 resp?
To wit: with this patch, mprotects commented out, and wasmBatchThreshold=10000, I see a net reduction on my machine from ~800ms to ~350ms for 4 cores and up!
Comment on attachment 8815243 [details] [diff] [review]
5.batch.patch

Review of attachment 8815243 [details] [diff] [review]:
-----------------------------------------------------------------

Overall looks great.  I had a few non-nit suggestions I'd like to review before r+.

::: js/src/jit/JitOptions.cpp
@@ +239,5 @@
>  
> +    // Until which wasm bytecode size should we accumulate functions, in order
> +    // to compile efficiently on helper threads. This has been determined by
> +    // trial and errors on AngryBots in November 2016.
> +    SET_DEFAULT(wasmBatchThreshold, 2000);

As suggested in comment 23, I'd add baseline versions of these.  Also, I think it'd be more meaningful to the reader to just link to the bug # rather than mentioning "AngryBots" or the date.

@@ +244,5 @@
> +
> +    // Functions which bytecode is bigger than this should be compiled right
> +    // away and not batched. This is the 90th percentile of function size in
> +    // AngryBots, in November 2016.
> +    SET_DEFAULT(wasmBigFunctionThreshold, 1000);

It's not totally clear to me why we need a special case for big functions: there are usually so many functions that, regardless of whether the current batch is mostly-empty or mostly-full, it should be fine to add a big function to the batch before launching it.  It may lead to a slightly bigger-than-average batch, but it shouldn't matter overall.

Playing with this option on my machine, I actually see a pretty distinct speedup for baseline (from 350ms to 307ms) from simply removing this option completely and no change for Ion.

::: js/src/wasm/WasmBaselineCompile.cpp
@@ +511,5 @@
>      Label                       outOfLinePrologue_;
>      Label                       bodyLabel_;
>      TrapOffset                  prologueTrapOffset_;
>  
> +    FuncOffsets&                offsets_;

What about storing FuncOffsets by value and then copying it into the FuncCompileUnit::finish() (as suggested by the other comment).

::: js/src/wasm/WasmGenerator.cpp
@@ +400,5 @@
>              return false;
>      }
>  
>      // Offset the recorded FuncOffsets by the offset of the function in the
>      // whole module's code segment.

This comment is duplicated below and it doesn't really describe the whole loop, so I'd remove.

@@ +925,5 @@
> +
> +bool
> +ModuleGenerator::launchBatchCompile()
> +{
> +    return launchBatchCompile(Move(batchedFuncs_), &batchedBytecode_);

With the current design, each batch will end up mallocing a new Vector internal array.  Instead, It'd be nice to recycle the Vector that is about to be Move-assigned to in launchBatchCompile() at task->set().  Instead of doing the work to swap Vectors etc,I was thinking more about patch 3 (move-task-outof-fg) and why it didn't provide a speedup: I think the reason is because there are (with bug 1317033 comment 2) 2*maxWasmCompilationThreads tasks and thus if we would've had to wait in startFuncDef() for a task to complete, that means we have a bunch of pending CompileTasks already and so there's no real advantage to us racing ahead to validate another function.  So if we reverted patch 3 then you could nix MG::batchedFuncs_ and instead reuse fg->task_->units_ (which would get clear()ed after each use but the memory would stay allocated).

@@ +929,5 @@
> +    return launchBatchCompile(Move(batchedFuncs_), &batchedBytecode_);
> +}
> +
> +bool
> +ModuleGenerator::enqueue(FuncCompileUnit&& input)

Since there's only one use, I'd inline into finishFuncDef(); even if it makes the function bigger, it's useful to be able to see all the state changes.

@@ +1204,5 @@
> +            if (!BaselineCompileFunction(task, &unit))
> +                return false;
> +            break;
> +          case CompileMode::None:
> +            MOZ_CRASH("Uninitialized task");

The "None" enumerator is never explicitly used, so I'd remove it.

::: js/src/wasm/WasmGenerator.h
@@ +82,3 @@
>  };
>  
> +// FuncCompileUnit contains all the data necessary to produce and stores the

s/and stores/and store/

@@ +82,5 @@
>  };
>  
> +// FuncCompileUnit contains all the data necessary to produce and stores the
> +// results of a single function's compilation. The produced code offsets remain
> +// dormant until the compilation actually happens.

Perhaps we could remove this last sentence and instead say it in types by making the fields private and having the interface:
  FuncCompileUnit(UniqueFuncBytes, CompileMode);
  const FuncBytes& bytes() const;
  CompileMode compilMode() const;
  void finish(FuncOffsets);
  FuncOffsets funcOffset() const;  // asserts finish() called
  UniqueFuncBytes recycle();  // asserts finish() called

@@ +84,5 @@
> +// FuncCompileUnit contains all the data necessary to produce and stores the
> +// results of a single function's compilation. The produced code offsets remain
> +// dormant until the compilation actually happens.
> +
> +struct FuncCompileUnit

I was thinking maybe FuncCompile*Task* since it's sortof a refinement of a CompileTask and "Unit" is a big abstract.  WDYT?

@@ +148,5 @@
> +    bool reset(UniqueBytesVector* freeBytes) {
> +        for (FuncCompileUnit& unit : units_) {
> +            if (!freeBytes->emplaceBack(Move(unit.func->recycle())))
> +                return false;
> +            freeBytes->back()->clear();

Perhaps even the clear() call belongs in FuncCompileUnit::recycle() (since it follows from the name).

@@ +229,5 @@
>      MOZ_MUST_USE bool initAsmJS(Metadata* asmJSMetadata);
>      MOZ_MUST_USE bool initWasm();
>  
>      // Functions declarations:
>      uint32_t numFuncImports() const;

pre-existing, but I'd remove this comment and move this method up right below funcCodeRange().
Thanks for the review! A few questions below.

(In reply to Luke Wagner [:luke] from comment #25)
> ::: js/src/jit/JitOptions.cpp
> @@ +239,5 @@
> >  
> > +    // Until which wasm bytecode size should we accumulate functions, in order
> > +    // to compile efficiently on helper threads. This has been determined by
> > +    // trial and errors on AngryBots in November 2016.
> > +    SET_DEFAULT(wasmBatchThreshold, 2000);
> 
> As suggested in comment 23, I'd add baseline versions of these.

It does sound like a good idea to have two thresholds for baseline vs ion, but there's one issue though (mitigated now, but still there in the future). CompileMode is present at the FuncCompileTask level, not at the batched compilation level, because 1. tiering could ask that in the same batch, some functions get compiled with ion and others with rabaldr, 2. this can already happen in asm.js mode running with rabaldr, if a function contains SIMD or Atomics it will still get ion-compiled.

We could have one batched task for ion and one for baseline (in which case having two thresholds is easier), maybe? Or whenever there are the two compilation modes in the same batch, we use a function of both threshold in order to have a more-suited threshold?

> 
> @@ +244,5 @@
> > +
> > +    // Functions which bytecode is bigger than this should be compiled right
> > +    // away and not batched. This is the 90th percentile of function size in
> > +    // AngryBots, in November 2016.
> > +    SET_DEFAULT(wasmBigFunctionThreshold, 1000);
> 
> It's not totally clear to me why we need a special case for big functions:
> there are usually so many functions that, regardless of whether the current
> batch is mostly-empty or mostly-full, it should be fine to add a big
> function to the batch before launching it.  It may lead to a slightly
> bigger-than-average batch, but it shouldn't matter overall.
> 
> Playing with this option on my machine, I actually see a pretty distinct
> speedup for baseline (from 350ms to 307ms) from simply removing this option
> completely and no change for Ion.

Ha, strange, I remember observing some small speedups for Ion with this (a few %, really). I assumed in the worst case, the batch is almost full and adding a very big function could double the batch code size, if not worse (a case where the threshold is 1000, it's at 999 before and a new function of size N > 1000 comes in). It's maybe OK for rabaldr but for ion it seems we would have to run 1. the compilation of a very big function (with quadratic behavior of regalloc etc.) and 2. the compilation of many other smaller functions. This is just intuition (no data here), but would be nice to confirm in some way.

Do you see any observable changes when using a big-function-threshold at least half the batch-threshold, if not more?

> ::: js/src/wasm/WasmGenerator.cpp
> @@ +400,5 @@
> >              return false;
> >      }
> >  
> >      // Offset the recorded FuncOffsets by the offset of the function in the
> >      // whole module's code segment.
> 
> This comment is duplicated below and it doesn't really describe the whole
> loop, so I'd remove.

Ha, I've overlooked my rebasing...

> @@ +84,5 @@
> > +// FuncCompileUnit contains all the data necessary to produce and stores the
> > +// results of a single function's compilation. The produced code offsets remain
> > +// dormant until the compilation actually happens.
> > +
> > +struct FuncCompileUnit
> 
> I was thinking maybe FuncCompile*Task* since it's sortof a refinement of a
> CompileTask and "Unit" is a big abstract.  WDYT?

Yeah I thought of that too and it seemed to me that "Task" would refer to data that is handed in to the helper threads; now it also makes sense since it's a sub-attribute of a task. Fwiw I've used "unit" with symmetry with C++: a cpp file (i.e. a C++ compilation unit, in our comparison equivalent to a FuncCompileUnit) is "batched" into an unified compilation file (~ a task). Unit embodies this notion of atomicity, but it is maybe too abstract indeed.
(In reply to Benjamin Bouvier [:bbouvier] from comment #26)
> > As suggested in comment 23, I'd add baseline versions of these.
> 
> It does sound like a good idea to have two thresholds for baseline vs ion,
> but there's one issue though (mitigated now, but still there in the future).
[...]

Oh hah, good point.  How about this: we have a single threshold constant but we make Ion bytecodes "cost" more by simply multipling their bytecode size by a second scaling constant (which represents the ratio of baseline:ion compile time), starting at, say, 5.

> few %, really). I assumed in the worst case, the batch is almost full and
> adding a very big function could double the batch code size, if not worse (a
> case where the threshold is 1000, it's at 999 before and a new function of
> size N > 1000 comes in).

I think that would only be a problem if we had very big batches.  But with the current smallish batches then either:
 - the "big" function is close to batch size, in which case it's not a huge difference and the whole batch won't take very long so it's unlikely to be on the critical path
 - the "big" function is much bigger than batch size, in which case the addition of the batch is negligible

> Do you see any observable changes when using a big-function-threshold at
> least half the batch-threshold, if not more?

I probed a variety of constants and wasn't ever able to see a significant improvement.

> Yeah I thought of that too and it seemed to me that "Task" would refer to
> data that is handed in to the helper threads; now it also makes sense since
> it's a sub-attribute of a task. Fwiw I've used "unit" with symmetry with
> C++: a cpp file (i.e. a C++ compilation unit, in our comparison equivalent
> to a FuncCompileUnit) is "batched" into an unified compilation file (~ a
> task). Unit embodies this notion of atomicity, but it is maybe too abstract
> indeed.

The "translation unit" is a reasonable metaphor.  I'm happy either way.
I'll mark this P1, since it seems that this is a nice win and Benjamin is driving this and seems this will get into the next release. Feel free to mark P2/P3 if that is not the case.
Priority: -- → P1
Attachment #8815243 - Flags: review?(luke)
Attached patch 5.batch.patchSplinter Review
Rebased and updated patch.

Some smaller cleanup patches coming ahead (one for comment 20 and one for reverting patch 3).

Experimentation showed that the best Ion scale factor on my machine is actually 10, not 5. With a batching threshold of 10000 bytecode units and an Ion scale factor of 10, I get the following absolute results when compiling AngryBots with 8 threads:

- in ion: 883ms
- in baseline: 187ms

Detailed results coming in a separate comment.
Attachment #8815243 - Attachment is obsolete: true
Attachment #8822060 - Flags: review?(luke)
This stores bytes by value in FuncBytes, and moves more fields from FunctionGenerator to FuncBytes directly, by adding more setters in FuncBytes. This is prettier and slightly lowers the amount of churn.
Attachment #8822061 - Flags: review?(luke)
Attached patch 7.reuse-units.patch (obsolete) — Splinter Review
And this is the patch reverting patch 3. (made a separate patch because I thought it would be more trouble, but it's actually pretty easy)
Attachment #8822062 - Flags: review?(luke)
Attached patch folded.patch (obsolete) — Splinter Review
Folded patch (might be easier to review, considering patches 6 and 7 change code that's in patch 5).
Here are the new detailed results for compilation speed:

http://bnjbvr.github.io/simplegraph/#title=Angry%20Bots%20batch%20effect&ytitle=Compilation%20time%20(ms)&categories=8%20threads%2C7%20threads%2C6%20threads%2C5%20threads%2C4%20threads%2C3%20threads%2C2%20threads%2C1%20thread&values=ion-before%20969.614990234375%20992.359130859375%201068.044921875%201196.0068359375%201272.85986328125%201509.02587890625%202203.779052734375%202363.5859375%0Aion-after%20883.4150390625%20922.944091796875%201005.19091796875%201111.296875%201222.7080078125%201459.281982421875%202149.27490234375%202226.699951171875%0Abaseline-before%20397.903076171875%20393.489990234375%20386.0869140625%20369.470947265625%20352.9228515625%20346.7900390625%20374.843994140625%20467.1640625%0Abaseline-after%20187.64599609375%20185.8720703125%20186.15283203125%20183.596923828125%20182.9541015625%20190.320068359375%20246.014892578125%20403.81494140625

- 11 repetitions for each configuration, taking the median as a stable measure. For N threads, the command line to run the benchmark was `taskset -c 0,{N - 1} $js --thread-count={N}`. This is measured on an Intel i7-6700K CPU clocked at 4 Ghz.
- note that between comment 9 and now, a month has passed and some compilation time improvements have already landed (viz. PageProtectingVector), so it's not an apple-to-apple comparison with the results we had before.
- up to 9% speedup with Ion, up to 52% speedup with rabaldr \o/
- for the baseline compiler, the results (both before and after the patch) follow the trend with the aberration after 3 threads: faster and faster up to 3 threads, then slowing down a bit. It's less observable in the case of the batched compilation because it's so fast, but pointing at the values shows that the trend is still there. Fortunately, dmajor has done some great observations and perf measurements recently :)
One tiny simplification for this patch: we actually don't even need to store the CompileTask in the FunctionGenerator, for extra memory savings.
Attachment #8822062 - Attachment is obsolete: true
Attachment #8822062 - Flags: review?(luke)
Attachment #8822067 - Flags: review?(luke)
Attached patch folded.patchSplinter Review
Attachment #8822063 - Attachment is obsolete: true
Comment on attachment 8822068 [details] [diff] [review]
folded.patch

Review of attachment 8822068 [details] [diff] [review]:
-----------------------------------------------------------------

(I just reviewed the folded patch)  All r+, only two tiny nits; very nice work here!

::: js/src/wasm/WasmGenerator.cpp
@@ +876,2 @@
>              return false;
> +        freeFuncBytes_.infallibleAppend(Move(funcBytes));

It seems to me that we don't have to create any FuncBytes in startFuncDefs() and in some cases (small modules) there might be a small win from doing it lazily (in startFuncDef).  (This is different than tasks_/freeTasks_ which have a fixed count and must be allocated up-front.)

::: js/src/wasm/WasmGenerator.h
@@ +312,5 @@
>  
>  // A FunctionGenerator encapsulates the generation of a single function body.
> +// ModuleGenerator::startFuncDef must be called after construction and before
> +// doing anything else.
> +// After the body is complete, ModuleGenerator::finishFuncDef must be called

I'd either keep the comment as-is or put a "// \n" to separate the paragraphs (and of the two I'd do the former).
Attachment #8822060 - Flags: review?(luke) → review+
Attachment #8822061 - Flags: review?(luke) → review+
Attachment #8822067 - Flags: review?(luke) → review+
(In reply to Benjamin Bouvier [:bbouvier] from comment #29)
> I get the following absolute results when compiling
> AngryBots with 8 threads:
> 
> - in ion: 883ms
> - in baseline: 187ms

And just because it's fun to see How Low Can You Go (on fast new hardware), what numbers do you see if you switch the PageProtectingVector in jit/x86-shared/Assembler-x86-shared.h to mozilla::Vector (as if defined RELEASE_OR_BETA)?
(In reply to Luke Wagner [:luke] from comment #37)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #29)
> > I get the following absolute results when compiling
> > AngryBots with 8 threads:
> > 
> > - in ion: 883ms
> > - in baseline: 187ms
> 
> And just because it's fun to see How Low Can You Go (on fast new hardware),
> what numbers do you see if you switch the PageProtectingVector in
> jit/x86-shared/Assembler-x86-shared.h to mozilla::Vector (as if defined
> RELEASE_OR_BETA)?

- ion: 902ms -> 875ms
- baseline: 183ms -> 147ms
Some update here: I was testing on mobile and it seems the ion scaling ratio is too high for mobile (i get a slowdown for ion compilation after patch). Investigating how to run the jsshell (i run into issues like [0] and bricked my phone already) or to compile the whole android browser on my machine to test different scaling constants.

- ion: 9498ms -> 11251ms (~18% slowdown)
- baseline: 4923ms -> 2052ms (~58% speedup)

[0] http://forum.xda-developers.com/google-nexus-5/development/fix-bypassing-pie-security-check-t2797731
Re comment 38 woohoo!
Re comment 39 makes sense to lower the ratio; since Ion didn't require as much batching to begin with, it seems appropriate to be conservative with this ratio.
After trying to compile the apk package and the JS shell for ARM for one day (and hitting unknown compilation failures for the APK + segfaults at startup before main for the JS shell), I've decided to give up and just use try with several constants to try better constants.

The results for compiling AngryBots with Ion are very odd:
- ion scale factor of 5: 11065ms
- 6: 7740ms
- 7: 10931ms
- 8: 7603ms
- 9: 5918ms
- 10 (as reported yesterday): 11251ms
- no patch (as reported yesterday): 9498ms

So I've used 9 temporarily, but it would make much more sense to investigate what is going on here: I wonder if the non-linear trend we're observing here is due to some very big functions being added to the batch and slowing it down, or something else entirely. The results have been very volatile on my phone (not the ion compilation time, but the validation time can vary up to 20%), so it just might be the CPU governor that needs to be set, or whatnot.

Also, it might be of interest to try these heuristics on other benchmarks (after all, this also affects asm.js compilation), and to try different heuristics for the batching threshold on mobile too, and on different devices. But that will be in a follow-up.
(forgot to remove leave-open)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: